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It  is  well  known  that  optical  interconnects  are  more  effective  (i.e.,  provide 
more  bandwidth,  speed,  and  less  power  consumption)  than  electronic  interconnects 
when  the  interconnection  distance  becomes  larger  than  a  few  millimeters.  The  OTIS 
optoelectronic  computer  provides  the  best  of  both  worlds  by  using  free-space  optical 
interconnects  to  connect  distant  processors  and  electronic  interconnect  for  processors 
that  are  close.  Optical  transpose  interconnection  system  (OTIS)  provides  a  fixed 
and  easy  to  realize  optical  topology;  the  topology  of  the  electronic  interconnect  is 
flexible.  By  using  different  electronic  topologies,  we  arrive  at  different  classes  of 
OTIS  computers.  For  example,  OTIS-Mesh  is  a  class  of  OTIS  computer  in  which  the 
electronic  interconnect  follows  the  mesh  paradigm,  and  OTIS-Hypercube  is  another 
class  of  OTIS  computer  such  that  the  hypercube  topology  is  used  to  realize  the 
electronic  interconnect. 
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In  this  dissertation  we  will  describe  the  OTIS  architecture  as  well  as  some 
of  its  properties.  Algorithms  for  some  frequently  used  permutations,  BPC  permu- 
tations, fundamental  operations,  and  some  applications  will  be  presented  for  the 
OTIS-Mesh  computer.  Properties  of  OTIS-Hypercube  will  also  be  discussed,  along 
with  algorithms  for  commonly  used  data  rearrangements  and  BPC  permutations. 
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CHAPTER  1 
INTRODUCTION 

It  is  well  known  that  when  communication  distances  exceed  a  few  millime- 
ters, optical  interconnects  provide  speed  (bandwidth)  and  power  advantages  over 
electronic  interconnects  [7,  23].  Therefore,  in  the  construction  of  very  large  multi- 
processor computers  it  is  prudent  to  interconnect  physically  close  processors  using 
electronic  interconnects  and  to  use  optical  interconnects  for  pairs  of  processors  that 
are  distant.  We  shall  assume  that  physically  close  processors  are  in  the  same  phys- 
ical package  (chip,  wafer,  board)  and  processors  that  are  not  physically  close  are  in 
different  packages.  As  a  result,  electronic  interconnects  are  used  for  intrapackage 
communications  while  optical  interconnect  is  used  for  interpackage  communication. 

Various  combinations  of  interconnection  networks  for  intrapackage  (i.e.,  elec- 
tronic) communications  and  interpackage  (i.e.,  optical  communications)  have  been 
proposed.  In  OTIS  computers  [12,  33,  58],  optical  interconnects  are  realized  via  a 
free  space  optical  interconnect  system  known  as  the  optical  transpose  interconnection 
system  (OTIS). 

In  this  chapter,  we  begin  by  describing  the  OTIS.  Next,  we  describe  the  OTIS- 
Mesh  and  OTIS-Hypercube  parallel  computers  that  result,  respectively,  when  the 
OTIS  optical  interconnect  system  is  used  for  interpackage  communication  and  a  mesh 
or  hypercube  is  used  for  intrapackage  communication.  Following  that,  we  show  that 
the  OTIS  computer  can  be  used  as  a  multistage  interconnection  network  (MIN). 
Finally,  we  provide  a  brief  description  of  the  remaining  chapters. 
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(0,*) 
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□  □  □  □ 
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Figure  1.1.  2-dimensional  arrangement  of  L  =  64  inputs  when  M  =  4  and  N  -  16: 
(a)v^xv^  =  2x2  grouping  of  inputs;  (b)  The  (t,  *)  group,  0  <  t  <  M  =  4 

yi  Optical  Transpose  Interconnection  System  (OTIS) 
The  optical  transpose  interconnection  system  (OTIS)  was  proposed  by  Mars- 
den  et  aL  [33].  The  OTIS  connects  L  =  MN  inputs  to  L  outputs  using  free  space 
optics  and  two  arrays  of  lenslets.  The  first  lenslet  array  is  a  y/Kf  x  y/M  array  and 
the  second  one  is  of  dimension  y/N  x  y/N.  Thus,  a  total  of  M  +  N  lenslets  are  used. 
The  L  inputs  and  outputs  are  arranged  to  form  ny/Zxy/Z  array.  The  L  inputs  are 
arranged  into  y/M  x  \fM  groups  with  each  group  containing  N  inputs  arranged  into 
a  y/N  x  y/N  array.  Figure  1.1  shows  the  arrangement  of  the  L  =  64  inputs  when 
M  =  4  and  N  =  16.  The  M  x  N  inputs  are  indexed  (ij)  with  0  <  i  <  M  and 
0  <  j  <  N.  Inputs  with  the  same  t  value  are  in  the  same  y/N  x  y/N  block.  The 
notation  (1,*),  for  example,  refers  to  all  inputs  of  the  form 

In  addition  to  using  the  two-dimensional  notation  (ij)  to  refer  to  an  input,  we 
also  use  a  four-dimensional  notation  (ir, tc,>>  jc)  where  (ir,ic)  gives  the  coordinates 
(row.column)  of  the  y/N  x  y/N  block  that  contains  the  input  (see  Figure  1.1(a))  and 
Or,  Jc)  gives  coordinates  of  the  element  within  a  block  (see  Figure  1.1(b)).  So  all 
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elements  (t»  with  t  =  0  have  (Wc)  =  (0,0);  those  with  t  =  1  have  (t„i«)  =  (0,1); 
those  with  i  =  2  have  (tr,tc)  =  (1,0);  and  those  with  t  =  3  have  (tr,ic)  =  (1,1). 
Similarly,  all  inputs  with  j  =  3  have  (jrJc)  =  (0,3),  and  those  with  j  =  12  have 
0r,Jc)  =  (3,0). 

The  L  outputs  are  also  arranged  into  a  VI  x  v/I  array.  This  time,  however, 
the  VZ  x  VI  array  is  composed  of  VN  x  VN  blocks  with  each  block  containing  M 
outputs  that  are  arranged  as  a  y/M  x  VM  array.  The  L  =  MN  outputs  are  indexed 
(i,j)  with  0  <  t  <  N,  0  <  J  <  M.  All  outputs  of  the  form  (i»  are  in  the  same 
block,  block  t.  Block  i  is  in  position  (t^O  with  i  =  iT\fN  +  ic  of  the  x  VN 
block  arrangement.  Outputs  of  the  form  (*,j)  are  in  position  (jrJc)  of  their  block, 
j  =  jrS/M  +  jc. 

In  the  physical  realization  of  OTIS,  the  VI  x  y/I  output  arrangement  is 
rotated  180°.  We  have  4  two-dimensional  planes;  the  first  is  the  \TL  x  VL  input 
plane;  the  second  is  a  VM  x  VM  lenslet  plane,  the  third  is  a  x  VN  lenslet 
plane,  and  the  fourth  is  the  \TL  x  VI  plane  of  outputs  rotated  180°.  When  the  OTIS 
is  viewed  from  the  side,  only  the  first  column  of  each  of  these  planes  is  visible.  Such 
a  side  view  for  the  case  L  =  M  x  N  =  4  x  16  is  shown  in  Figure  1.2.  Notice  that  the 
first  column  of  the  input  plane  consists  of  the  inputs  (0,0),  (0,4),  (0,8),  (0,12),  (2,0), 
(2,4),  (2,8),  (2,12)  which  in  4D  notation  are  (0,0,0,0),  (0,0,1,0),  (0,0,2,0),  (0,0,3,0), 
(1,0,0,0),  (1,0,1,0),  (1,0,2,0),  (1,0,3,0).  The  inputs  in  the  same  row  as  (0,0,0,0)  are 
(0,*,0,*),  those  in  the  same  row  as  (tr, icjr,  jc)  are  (t,,*, >,*).  The  (tr,>)  values 
top  to  bottom  are  (0,0),  (0,1),  (0,2),  (0,3),  (1,0),  (1,1),  (1,2),  (1,3).  The  first  column 
in  the  output  plane  (after  the  180°  rotation)  has  the  outputs  (15,3),  (15,1),  (11,3), 
(11,1),  (7,3),  (7,1),  (3,3),  (3,1)  which  in  4D  notation  are  (3,3,1,1),  (3,3,0,1),  (2,3,1,1), 
(2,3,0,1),  (1,3,1,1),  (1,3,0,1),  (0,3,1,1),  (0,3,0,1).  The  outputs  in  the  same  row  as 


Figure  1.2.  Side  view  of  the  OTIS  with  M  =  4  and  N  =  16 

(3,3,1,1)  are  (3,*,1,*);  those  in  the  same  row  as  (tr,tc,ir.ic)  are  (tr, *Jr,*)-  The 
(V,  jr)  values  top  to  bottom  are  (3,1),  (3,0),  (2,1),  (2,0),  (1,1),  (1,0),  (0,1),  (0,0). 

Each  lens  of  Figure  1.2  denotes  a  row  of  lenslets  and  each  O  *  row  of  in- 
puts or  outputs.  The  interconnection  pattern  defined  by  the  given  arrangement  of 
inputs,  outputs,  and  lenslets  connects  input  (i,  j)  =  (tr,te,;r,ic)  to  output  0,0  = 
0V,  jc,»r,«c)-  The  connection  is  established  via  an  optical  ray  that  originates  at  in- 
put position  (v.tcir.ie),  goes  through  lenslet  (ir,ic)  of  the  first  lenslet  array,  then 
through  lenslet  f>,  jc)  of  the  second  lenslet  array,  and  finally  arrives  at  output  posi- 
tion 0r,;e.*r»«c)- 

The  basic  connectivity  provided  by  the  OTIS  is  an  optical  connection  between 
input         and  output  (j,0»  0  <  t  <  A/,  0  <  i  <  AT. 
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1  7    OTIS  Parcel  Computers 

Marsden  et  d.  [33]  have  proposed  several  parallel  computer  architectures  in 
which  OTIS  is  used  to  connect  processors  in  different  groups  (packages)  and  an 
electronic  interconnection  network  is  used  to  connect  processors  in  the  same  group. 
Since  Krishnamoorthy  et  aL  [23]  have  shown  that  bandwidth  is  maximized  and  power 
consumption  minimized  when  an  L  =  AT2  processor  OTIS  computer  is  partitioned 
into  N  groups  of  N  processors  each,  Zane  et  al.  [58]  limit  the  study  of  OTIS  parallel 
computers  so  that  each  processor  group  (package)  has  N  processors  and  the  parallel 
computer  has  a  total  of  N  groups  (packages).  Let  denote  processor  j  of  package 
i,  0  <  i  <  N,  0  <  <  N.  Processor  i  ±  h  connected  to  processor  (j,i) 
using  free  space  optics  (i.e.,  OTIS).  The  only  other  connections  available  in  an  OTIS 
computer  are  the  electronic  intragroup  connections. 

A  generic  16  processor  OTIS  computer  is  shown  in  Figure  1.3.  The  solid  boxes 
denote  processors.  Each  processor  is  labeled  (g,p)  where  g  is  the  group  index  and  p  is 
the  processor  index.  OTIS  connections  are  shown  by  arrows.  Intra  group  connections 
are  not  shown. 

In  an  OTIS-Mesh,  processors  in  the  same  group  are  connected  as  a  2-D  mesh 
[33,  58,  46]  (Chapter  2);  and  in  an  OTIS-Hypercube  (Chapter  7),  processors  in  the 
same  group  are  connected  using  the  hypercube  topology  [33,  58,  48].  OTIS-Mesh  of 
trees  [33],  OTIS-Perfect  shuffle,  OTIS-Cube  connected  cycles,  etc.  may  be  defined  in 

an  analogous  manner. 

When  analyzing  algorithms  for  OTIS  architectures,  we  count  data  moves  along 
electronic  interconnects  (i.e.,  electronic  moves)  and  those  along  optical  interconnects 
(te.,  OTIS  moves)  separately.  This  allows  us  to  later  account  for  any  differences  in 
the  speed  and  bandwidth  of  these  two  types  of  interconnect. 
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group  0  group  1 


group  2  group  3 


Figure  1.3.  Example  of  OTIS  connections  with  16  processors 
1 ,3    Permutation  ftnnting  on  OPS  Computers 

Suppose  we  wish  to  rearrange  the  data  in  an  N*  processor  OTIS  computer 
according  to  the  permutation  II  =  n[0]  •  •  •  U[N*  -  1].  That  is,  data  from  processor 
i  =  gN  +  p  is  to  be  sent  to  processor  n[t],  0  <  t  <  N*.  We  assume  that  the 
interconnection  network  in  each  group  is  able  to  sort  the  data  in  its  N  processors 
(equivalent^,  it  is  able  to  perform  any  permutation  of  the  data  in  its  TV  processors). 
This  assumption  is  certainly  valid  for  the  mesh,  hypercube,  perfect  shuffle,  cube- 
connected  cycles,  and  mesh  of  trees  interconnections  mentioned  earlier. 

Theorem  1.3.1  Every  OTIS  computer  in  which  each  group  can  sort  can  perform  any 
permutation  U  using  at  most  2  OTIS  moves. 
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Figure  1.4.  Multistage  interconnection  network  (MIN)  defined  by  OTIS 

Proof  When  2  OTIS  moves  are  permitted,  the  data  movement  can  be  modeled  by 
a  3  stage  MIN  (multistage  interconnection  network)  as  in  Figure  1.4.  Each  switch 
represents  a  processor  group  which  is  capable  of  performing  any  N  input  to  N  output 
permutation.  The  OTIS  moves  are  represented  by  the  connections  from  one  stage  to 
the  next. 

The  OTIS  interstage  connections  are  equivalent  to  the  interstage  connections 
in  a  standard  MIN  that  uses  N  x  N  switches.  From  MIN  theory  [22],  we  know  that 
when  k  x  it  switches  are  used,  2  log*  N*  -  1  stages  of  switches  are  sufficient  to  make 
an  N7  input  tf2  output  network  that  can  realize  every  input  to  output  permutation. 
In  our  case  (Figure  1.4),  k  =  N.  Therefore,  2  log*  -  1  =  3  stages  are  sufficient. 
Hence  2  OTIS  moves  suffice  to  realize  any  permutation. 

An  alternative  proof  comes  from  an  equivalence  with  the  preemptive  open 
shop  scheduling  problem  (POSP)  [9].  In  the  POSP  we  are  given  n  jobs  that  are  to  be 
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scheduled  on  m  machines.  Each  job  i  has  m  tasks.  The  task  length  of  the  jth  task 
of  job  t  is  the  integer  Uj  >  0.  In  a  preemptive  schedule  of  length  T,  the  time  interval 
from  0  to  T  is  divided  into  slices  of  length  1  unit  each.  A  time  slice  is  divided  into  m 
slots  with  each  slot  representing  a  unit  time  interval  on  one  machine.  Time  slots  on 
each  machine  are  labeled  with  a  job  index.  The  labeling  is  done  in  such  a  way  that 

(a)  each  job  (index)  i  is  assigned  to  exactly  Uj  slots  on  machine  j,  0  <  j  <  m,  and 

(b)  no  job  is  assigned  to  two  or  more  machines  in  any  time  slice.  T  is  the  schedule 
length.  The  objective  is  to  find  the  smallest  T  for  which  a  schedule  exists.  Gonzalez 
and  Sahni  [9]  have  shown  that  the  length  T**,  of  an  optimal  schedule  is 

where  =  max^E]^1  Uj}  («•«-,  Jm»  is  the  maximum  job  length)  and  = 
max^E^o  Uj}  (»•«•>  Mmms    the  maximum  processing  to  be  done  by  any  machine). 

We  can  transform  the  OTIS  computer  permutation  routing  problem  into  a 
POSP.  First,  note  that  to  realize  a  permutation  II  with  2  OTIS  moves,  we  must  be 
able  to  write  II  as  a  sequence  of  permutations  noTTTiTITa  where  IT,  is  the  permutation 
realized  by  the  switches  (i.e.,  processor  groups)  in  stage  t  and  T  denotes  the  OTIS 
(transpose)  interstage  permutation.  Let  (gvp9)  denote  processor  p,  of  group  where 

q  €  {to,oo,»i,0i,»2,02}  (*o  =  inPut  of  sta€e  °>  °»  =  outPut  of  8taGe  °>  etc')*  Then' 
the  data  path  is  (fto.p*,)        (5oo>P<J        (Poo.&o)  =  (Pm>P»j)        (&>i>P<>i)  > 

We  observe  that  to  realize  the  permutation  IT,  the  following  must  hold: 

(i)  Switch  i  of  stage  1  should  receive  exactly  one  data  item  from  each  switch  of 
stage  0,  0  <  i  <  N. 
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(ii)  Switch  i  of  stage  1  should  receive  exactly  one  data  item  destined  for  each  switch 
of  stage  2,  0  <  i  <  N. 

Once  we  know  which  data  items  will  get  to  switch  i,  0  <  «  <  Nt  we  can  easily 
compute  n0,  Uu  and  II2.  Therefore,  it  is  sufficient  to  demonstrate  the  existence  of  an 
assignment  of  the  N*  stage  0  inputs  to  the  switches  in  stage  1  satisfying  conditions 
(i)  and  (ii).  For  this,  we  construct  an  N  job  N  machine  POSP  instance.  Job  t 
represents  switch  i  of  stage  0  and  machine  j  represents  switch  j  of  stage  2.  The 
task  time  Uj  equals  the  number  of  inputs  to  switch  t  of  stage  0  that  are  destined  for 
switch  j  of  stage  2  (i.e.,  t0  is  the  number  of  group  i  data  that  are  destined  for  group 
j).  Since  n  is  a  permutation,  it  follows  that  E£*  <fc  =  total  nuraber  of  inPuts  to 
switch  i  of  stage  0  =  N  and  E^1  Uj  =  total  number  of  inputs  destined  for  switch  j 
of  stage  2  =  N.  Therefore,  =  =  N  and  the  optimal  schedule  length  is  N. 
Since  Ef^1  Uj  =  ^  and  the  optimal  schedule  length  is  N,  every  slot  of  every 
machine  is  assigned  a  task  in  an  optimal  schedule.  From  the  property  of  a  schedule, 
it  follows  that  in  each  time  slice  all  N  job  labels  occur  exactly  once.  The  N  labels  in 
slice  i  of  the  schedule  define  the  inputs  that  are  to  be  assigned  to  switch  i  of  stage  1, 
0<i<N.  From  properties  (a)  and  (b)  of  a  schedule,  it  follows  that  this  assignment 
satisfies  the  requirements  (i)  and  (ii)  for  an  assignment  to  the  stage  1  switches.  □ 

Even  though  every  permutation  II  can  be  realized  with  just  2  OTIS  moves, 
it  takes  many  more  OTIS  moves  to  compute  the  decomposition  II  =  n0TTIiTTl2. 
Therefore,  simulating  the  3  stage  MIN  of  Figure  1.4  does  not  result  in  an  efficient 
algorithm  to  perform  permutation  routing.  Consequently,  we  have  developed  cus- 
tomized algorithms  for  specific  as  well  as  generalized  BPC  permutations  (Chapters  3 
and  7).  General  permutations  may  be  realized  using  the  sorting  algorithm  (Sec- 
tion 4.13),  which  uses  o(y/N)  OTIS  moves. 
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]  4    This  Dissertation 
In  this  dissertation,  properties  of  OTIS-Mesh  and  OTIS-Hypercube  are  stud- 
ied and  obtained.  The  dissertation  is  organized  as  follows: 

•  Chapter  2  deals  with  some  fundamental  properties  such  as  diameter  and  em- 
bedding schemes  of  OTIS-Mesh. 

•  OTIS-Mesh  algorithms  for  frequently  used  permutations  are  presented  in  Chap- 
ter 3,  along  with  the  algorithm  for  general  BPC  permutations. 

•  Algorithms  for  basic  operations-broadcast,  prefix  sum,  rank,  sort,  and  so  on- 
are  developed  in  Chapter  4. 

•  Chapter  5  demonstrates  how  matrix  multiplications  are  performed. 

•  Chapter  6  presents  algorithms  for  some  well  known  image  processing  applica- 
tions. 

•  Properties  of  the  OTIS-hypercube  are  studied  in  Chapter  7,  as  well  as  algo- 
rithms for  commonly  used  permutations  and  general  BPC  permutations. 

•  And  finally,  Chapter  8  summarizes  the  whole  dissertation,  and  gives  directions 
for  research. 


CHAPTER  2 
PROPERTIES  OF  OTIS-MESH 

In  an  N7  processor  OTIS-Mesh,  each  group  is  a  y/N  x  y/N  mesh  and  there 
are  a  total  of  N  groups.  Figure  2.1  shows  a  16  processor  OTIS-Mesh.  The  processors 
of  groups  0  and  2  are  labeled  using  two  dimensional  local  mesh  coordinates  while  the 
processors  in  groups  1  and  3  are  labeled  in  row-major  fashion.  We  use  the  notation 
(g,p)  to  refer  to  processor  p  of  group  g. 

In  this  chapter,  we  first  show  that  the  diameter  of  the  OTIS-Mesh  is  Ay/N-3. 
Then,  we  demonstrate  how  OTIS-Mesh  can  simulate  a  4D-mesh,  as  well  as  a  2D-mesh. 

?  1    Diameter  of  the  OTIS-Mesh 

Let  foi, pi)  and  pj)  be  two  OTIS-Mesh  processors.  The  shortest  path 
between  these  two  processors  is  of  one  of  the  form: 

(a)  The  path  involves  only  electronic  moves.  This  is  possible  only  when  gi  =  g2- 

(b)  The  path  involves  an  even  number  of  optical  moves.  In  this  case  the  path 
is  of  the  form  (gupi)        (ft,rf)        iA>9i)  ^  (pWi)        W\M  — > 

Here  E*  denotes  a  sequence  (possibly  empty)  of  electronic  moves  and  O  denotes 
a  single  OTIS  move.  If  the  number  of  OTIS  moves  is  more  than  two,  we  may 
compress  paths  of  this  form  into  the  shorter  path  (gu Pi)  (tfi>P2)  ► 
(p^)  (pa,^)  {92,92)-  So  we  may  assume  that  the  path  is  of  the 
above  form  with  exactly  two  OTIS  moves. 

(c)  The  path  involves  an  odd  number  of  OTIS  moves.  In  this  case,  it  must  involve 
exactly  one  OTIS  move  (as  otherwise  it  may  be  compressed  into  a  shorter  path 
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Figure  2.1.  16  Processor  OTIS-Mesh 
with  just  one  OTIS  move  as  in  (b))  and  may  be  assumed  to  be  of  the  form 

Let  d(t,  j)  be  the  shortest  distance  between  processors  i  and  of  a  group  using 
a  path  comprised  solely  of  electronic  moves.  So,  d(ij)  is  the  Manhattan  distance 
between  the  two  processors  of  the  local  mesh  group.  Shortest  paths  of  type  (a)  have 
length  dOh.pa)  while  those  of  types  (b)  and  (c)  have  length  d(pi,pi)  +  d{gugi)  +  2 
and  d(pi,«fc)  +  dfa, 9i)  +  1,  respectively. 

From  the  preceding  discussion  we  have  the  following  theorems: 

Theorem  2.1.1  The  length  of  the  shortest  path  between  processors  (ffi,Pi)  and  (<ft,P2) 
is  d(pi,i>2)  when  ft  =  g?  and  min{d(jh,P2)  +  d{gx,gi)  +  2,d(pi,^)  +  d(P2>0i)  +  1} 
when  gi  ± 
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Proof  When  gi  =  <ft,  there  are  three  possibilities  for  the  shortest  path.  It  may  be 
of  types  (a),  (b),  or  (c).  If  it  is  of  type  (a),  its  length  is  d^fr).  If  »t  is  of  type  (b), 
its  length  is  dfapt)  +  d(glt92)  +  2  =  d(pi,Pa)  +  2.  If  it  is  of  type  (c),  its  length  is 
d{pll92)+d(p2tgl)+l  =  d(pi,*i)+«*(P2,0i)+l  =  <*(pi,<7i)+d(Si,J>2)+l  >  d(Pi,Pa)+l- 
So,  the  shortest  path  has  length  d(pi,pa).  When  gi  ?  fr,  the  shortest  path  is 
either  of  type  (b)  or  (c).  From  our  earlier  development  it  follows  that  its  length  is 
mn{d(PuPi)  +  d{GuG2)  +  M(A,G2)  +  d{P2>Gx)  +  1}.  □ 
Thp.nrr.rn  2.1.2  The  diameter  of  the  OTIS- Mesh  is  4y/N  -  3. 

Proof  Since  each  group  is  a  y/N  x  y/N  mesh,  d(pi,pa),  d(P2,0i),  d(pi,g?)y  and 
d(gu&)  are  all  less  than  or  equal  to  2(y/N  -  1).  From  Theorem  2.1.1,  it  follows 
that  no  two  processors  are  more  than  4(y/N  -  1)  +  1  =  4v^  -  3  apart.  Hence,  the 
diameter  is  <  4^  -  3.  Now  consider  the  processors  (51,  Pi),  (<?2,P2)  such  that  pi 
is  in  position  (0, 0)  of  its  group  and  pa  is  in  position  (y/N  -  1,  y/N  -  1)  (i.e.,  j>i  the 
top  left  processor  and  pa  the  bottom  right  one  of  its  group).  Let  gi  be  0  and  pa  be 
N-l.  So,  dOh.pa)  =  d(g1,g2)  =  <f(pi,02)  =  <*(P2,0i)  =  VN-l.  Hence,  the  distance 
between  (gupi)  and  (pa.pa)  is  4^-3.  As  a  result,  the  diameter  of  the  OTIS-Mesh 
is  exactly  Ay/N  —  3.  □ 

2.2   Simulation  of  a  4D  Mesh 

Zane  et  al.  [58]  have  shown  that  the  OTIS-Mesh  can  simulate  each  move  of 
a  y/N  x  y/N  x  y/N  x  y/N  four-dimensional  mesh  by  using  either  a  single  electronic 
move  local  to  a  group  or  using  one  local  electronic  move  and  two  intergroup  OTIS 
moves.  For  the  simulation,  we  must  first  embed  the  4D  mesh  into  the  OTIS-Mesh. 
The  embedding  is  rather  straightforward  with  processor  (f,j,M)  o{  the  40  mesh 
being  identified  with  processor  (g,p)  of  the  OTIS-Mesh.  Here,  g  =  iy/N  +  j  and 
p=ky/N  +  l. 
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The  mesh  moves  (t,i,*  ±  1,1)  and  (i,j,M±  1)  can  be  performed  with  one 
electronic  move  of  the  OTIS-Mesh  while  the  moves  (t, j  ±  and  (i  ± 

require  one  electronic  and  two  optical  moves.  For  example,  the  move  (i,  j+l,  k,  J)  may 
be  done  by  the  sequence  (i,j,k,l)  (*»MJ  + 1)  +  LM- 

The  above  efficient  embedding  of  a  4D  mesh  implies  that  4D  mesh  algorithms 
can  be  run  on  the  OTIS-Mesh  with  a  constant  factor  (at  most  3)  slowdown  [58]. 
Unfortunately,  the  body  of  known  4D  mesh  algorithms  is  very  small  compared  to 
that  of  2D  mesh  algorithms.  So,  it  is  desirable  to  consider  a  2D  mesh  embedding. 
Such  an  embedding  will  enable  one  to  run  2D  mesh  algorithms  on  the  OTIS-Mesh. 
Naturally,  one  would  do  this  only  for  problems  for  which  no  4D  algorithm  is  known 
or  for  which  the  known  4D  mesh  algorithms  are  not  faster  than  the  2D  algorithms. 

2.3    Simula*,'™1  ™*  *  2D  Mesh 

There  are  at  least  two  intuitively  appealing  ways  to  embed  an  N  x  N  mesh 
into  the  OTIS-Mesh.  One  is  the  group  row  mapping  (GRM)  in  which  each  group 
of  the  OTIS-Mesh  represents  a  row  of  the  2D  mesh.  The  mapping  of  the  mesh  row 
onto  a  group  of  OTIS  processors  is  done  in  a  snake-like  fashion  as  in  Figure  2.2(a). 
The  pair  of  numbers  in  each  processor  of  Figure  2.2(a)  gives  the  (row.column)  index 
of  the  mapped  2D  mesh  processor.  The  thick  edges  show  the  electronic  connections 
used  to  obtain  the  2D  mesh  row.  Notice  that  the  assignment  of  rows  to  groups  is 
also  done  in  a  snake-like  manner.  Let  (»,;)  denote  a  processor  of  a  2D  mesh.  The 
move  to  (t,  j  +  1)  (or  (»,  j  -  1))  can  be  done  with  one  electronic  move  as  (ij)  and 
(t,  j  + 1)  are  neighbors  in  a  processor  group.  If  all  elements  of  row  i  are  to  be  moved 
over  one  column,  then  the  OTIS-Mesh  would  need  one  electronic  move  in  case  of  a 
MIMD  mesh  and  3  in  case  of  a  SIMD  mesh  as  the  row  move  would  involve  a  shift 
by  one  left,  right,  and  down  within  a  group.  A  column  shift  can  be  done  with  2 


15 


group  0 


group  1 


group  0 


group  1 


group  3 


group  2 


group  2 


group  3 


(»> 


(b) 


Figure  2.2.  Mapping  a  4  x  4  mesh  onto  a  16  processor  OTIS-Mesh:  (a)  GRM;  (b) 
GSM 

additional  OTIS  moves  as  in  the  case  of  a  4D  mesh  embedding.  GRM  is  particularly 
nice  for  the  matrix  transpose  operation.  Data  from  processor  (i,  j)  can  be  moved  to 
processor  fj,i)  with  one  OTIS  and  zero  electronic  moves. 

The  second  way  to  embed  an  N  x  N  mesh  is  to  use  the  group  submesh  map- 
ping (GSM).  In  this,  the  N  x  N  mesh  is  partitioned  into  N  y/Nxy/N  submeshes. 
Each  of  these  is  mapped  in  the  natural  way  onto  a  group  of  OTIS-Mesh  processors. 
Figure  2.2(b)  shows  GSM  of  a  4  x  4  mesh.  Moving  all  elements  of  row  or  column 
i  over  by  one  is  now  considerably  more  expensive.  For  example,  a  row  shift  by  +1 
would  be  accomplished  by  the  following  data  movements  (a  boundary  processor  is 
one  on  the  right  boundary  of  a  group): 

Step  1:  Shift  data  in  non-boundary  processors  right  by  one  using  an  electronic  move. 

Step  2:  Perform  an  OTIS  move  on  boundary  processor  data.  So,  data  from  (g,p)  move 
to  (p,g). 
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Step  3:  Shift  the  data  moved  in  Step  2  right  by  one  using  an  electronic  move.  Now,  the 
data  from  (g,p)  are  in  (p,g  +  1). 

Step  4:  Perform  an  OTIS  move  on  these  data.  Now  data  originally  in  (g,p)  are  in 

Step  5:  Shift  the  data  left  by  VN-1  using  y/N-l  electronic  moves.  Now,  the  boundary 
data  originally  in  (y,p)  are  in  the  processor  to  its  right  but  in  the  next  group. 

The  above  five  step  process  takes  electronic  and  two  OTIS  moves.  Note, 
however,  that  if  each  group  is  a  wraparound  mesh  in  which  the  last  processor  of  each 
row  connects  to  the  first  and  the  bottom  processor  of  each  column  connects  to  the 
top  one,  then  row  and  column  shift  operations  become  much  simpler  as  Step  1  may 
be  eliminated  and  Step  5  replaced  by  a  right  wraparound  shift  of  1.  The  complexity 
is  now  two  electronic  and  two  OTIS  moves. 

GSM  is  also  inferior  on  the  transpose  operation  which  now  requires  S(y/N- 1) 

electronic  and  2  OTIS  moves. 

Theorem  2.3.1  [46]  The  transpose  operation  of  an  N  x  N  mesh  requires  8{y/N  -  1) 
electronic  and  2  OTIS  moves  when  the  GSM  is  used. 

Proof  Let  gxOy  and  pxpy  denote  processor  (g,p)  of  the  OTIS-Mesh.  This  processor 
is  in  position  (px,p,)  of  group  (gx,gy)  and  corresponds  to  processor  (gxpx,gyPy)  of 
the  N  x  N  embedded  mesh.  To  accomplish  the  transpose,  data  are  to  be  moved 
from  theNxN  mesh  processor  {gxpx,gypy)  (ie.,  the  OTIS-Mesh  processor  {g,p)  = 
(gxgv,pxpy))  to  the  mesh  processor  (gypyi  gxpx)  (i.e.,  the  OTIS-Mesh  processor  (^^.PyPi))- 
The  following  movements  do  this:  {gxpx,gvPy)  (9xPyi9yPx)  iPy9x>Px9v)  ► 
(py9y,Px9z)        (9vPy,9xPx)-  Once  again  E*  denotes  a  sequence  of  electronic  moves 
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local  to  a  group  and  O  denotes  a  single  OTIS  move.  The  E  moves  in  this  case  perform 
a  transpose  in  a  y/N  x  VN  mesh.  Each  of  these  transposes  can  be  done  in  A{yfN  - 1) 
moves  [34].  So,  the  above  transpose  method  uses  &(VN  -  1)  electronic  and  2  OTIS 
moves. 

To  see  that  this  is  optimal,  first  note  that  every  transpose  algorithm  requires 
at  least  2  OTIS  moves.  For  this,  pick  a  group  gtgy  such  that  gx  ^  gy.  Data  from  all 
N  processors  in  this  group  are  to  move  to  the  processors  in  group  gvgs.  This  requires 
at  least  one  OTIS  move.  However,  if  only  one  OTIS  move  is  performed,  data  from 
gs9y  is  scattered  to  the  N  groups.  So,  at  least  two  OTIS  moves  are  needed  if  the 
data  ends  up  in  the  same  group. 

Next,  we  shall  show  that  independent  of  the  OTIS  moves,  at  least  8(^-1) 
electronic  moves  must  be  performed.  The  electronic  moves  cumulatively  perform  one 
of  the  following  two  transforms  (depending  on  whether  the  number  of  OTIS  moves 
is  even  or  odd,  see  previous  section  about  the  diameter): 

(a)  local  moves  from  (p«,p,)  to  (p,,p»);  local  moves  from  (gx,gy)  to  (gy,gx); 

(b)  local  moves  from  (px,py)  to  {gy,gx)\  local  moves  from  (gx,gy)  to  (p,,,Px). 

For  (p.,*)  =  «  (OtVN-  1),  (a)  and  (b)  require  2(^-1)  left 

and  2(y/N  -  1)  down  moves.  For  (p„pj  =  (<7x,5»)  =  ~  M)i  (a)  and  (b) 
require  2{-jN  -  1)  right  and  2{VN  -  1)  up  moves.  The  total  number  of  moves  is 
thus  8(v/^  -  1).  So,  S(y/N  -  1)  is  a  lower  bound  on  the  number  of  electronic  moves 
needed.  □ 


CHAPTER  3 
DATA  REARRANGEMENT  ON  AN  OTIS-MESH 

From  Section  1.3,  we  know  that  an  N7  processor  OTIS-Mesh  can  realize  any 
permutation  of  N7  data  (one  to  each  processor)  using  at  most  two  OTIS  moves. 
However,  additional  OTIS  moves  are  needed  to  determine  the  local  group  data  rear- 
rangements that  must  be  made. 

In  this  chapter,  we  first  develop  algorithms  to  realize  permutations  such  as 
transpose,  shuffle,  unshuffle,  and  vector  reversal  which  arise  frequently  in  applica- 
tions. Nassimi  and  Sahni  [34]  have  developed  optimal  4D  mesh  algorithms  for  several 
frequently  arising  permutations.  These  may  be  simulated  using  the  method  of  Zane  et 
d.  [58]  to  obtain  algorithms  for  the  OTIS-Mesh.  Table  3.1  gives  the  number  of  4D 
mesh  moves  used  by  the  optimal  4D  mesh  algorithms,  a  breakdown  of  the  number 
of  moves  in  the  first  two  and  last  two  dimensions,  and  the  number  of  electronic  and 
OTIS  moves  required  by  the  simulation. 

In  the  following  sections  we  shall  obtain  OTIS-Mesh  algorithms  for  the  permu- 
tations of  Table  3.1,  that  require  far  fewer  moves  than  the  simulations  of  the  optimal 
4D  mesh  algorithms. 

Assume  that  the  N2  OTIS-Mesh  processors  are  numbered/indexed  0  through 
N*  -  1  such  that  in  the  binary  representation  of  a  processor  index  the  left  half  bits 
give  the  group  number  and  the  right  half  give  the  processor  number  local  to  a  group. 
So,  a  processor  index  /  is  of  the  form  I  =  GP  where  /,  G  and  P  are  represented  in 
binary  and  G  and  P  have  the  same  number  of  bits.  G  and  P  may  be  decomposed 
into  halves  to  get  G  =  GxGy  and  P  =  PSPV  such  that  Gx  and  Gv  give  the  group 
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Table  3.1.  Optimal  moves  for  4D  mesh  and  respective  OTIS-Mesh  simulations 


Permutation 

4D  mesh 

OTIS-Mesh  Simulation 

total 

dim.  1  +  2 

dim.  3  +  4 

OTIS 

electronic 

Transpose 

S(VN-l) 

4(VN  -  1) 

4(VN  -  1) 

S(VN-l) 

S(VN  -  1) 

Perfect  Shuffle 

4\fN 

2y/N 

2VN 

AyfN 

4sTN 

Unshuffle 

WN 

2VN 

2s/N 

Wn 

4VN 

Bit  Reversal 

StfN  -  1) 

A(sfN  -  1) 

4(vW-l) 

HVn-1) 

HVN  -  1) 

Vector  Reversal 

8(VW-1) 

4(V^~  1) 

4(vW-l) 

S(sfN-l) 

HVN-i) 

Bit  Shuffle 

®y/N-4 

4^-2 

Vy/N-4 

&VN-4 

Shuffled  Row-major 

#v^-4 

WN-2 

4>/N-2 

WN-4 

#v/JV-4 

GVPX  Swap 

4(V3v  - 1) 

2{VN-1) 

KVN-l) 

4[VN-1) 

4{>/N-l) 

location  by  row  and  column  in  an  array  layout  of  groups  (as  in  Figure  2.1)  and  P t 
and  Pt  locate  processor  P  of  a  group  by  its  row  and  column  coordinates. 

The  permutations  of  Table  3.1  are  members  of  the  BPC  (bit  permute  com- 
plement) class  of  permutations  denned  in  Nassimi  and  Sahni  [34].  The  definition  of 
the  BPC  permutation  and  its  relations  with  those  permutations  in  Table  3.1  will  be 

presented  in  the  last  section,  along  with  the  development  of  the  algorithm. 

3.1  Transpose 

The  transpose  operation  may  be  accomplished  via  a  single  OTIS  move  and 

zero  electronic  moves.  The  simulation  of  the  optimal  4D  mesh  algorithm,  however, 

takes  8(y/N  -  1)  OTIS  and  S(y/N  -  1)  electronic  moves. 

3.2    Perfect  Shuffle 

Let  G  represent  the  first  half  of  the  bits  in  the  processor  index  and  P  the 
second  half.  Let  bG{i)  and  6/>{l),  respectively,  denote  the  bits  in  position  G{i)  and 
P(i)  of  G  and  P.  So  6c(p/2-i)  and  &/>0>/2-i)  are  the  most  significant  bits  of  G  and 
P  while  6G(o)  and  bP(0)  are  the  least.  Let  G  =  6G(j»/2-i)G'  and  P  -  bp^/i-vP1  •  A 
perfect  shuffle  may  be  performed  as  below: 
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Step  1:  Perform  a  local  perfect  shuffle  in  each  group.  This  moves  data  from  every 
processor  GP  to  the  corresponding  processor  GP'bp^-i)- 

Step  2:  This  step  involves  processors  in  groups  G  such  that  6c(p/2-i)  =  0  onlv-  In 
these  groups,  odd  processors  exchange  data  with  corresponding  even  processors 
(note  that  the  processors  exchanging  data  differ  only  in  bit  zero).  To  see  the 
new  data  arrangement,  it  is  convenient  to  separate  out  four  cases  depending  on 
the  values  of  bG{j>/2-i)  and  bP{p/2-i).  Steps  1  and  2  accomplish  the  following: 

OG'OP'  OG'P'O  OG'P'l 

OG'IP*  OG'P'l  OG'P'O 

ICOP*  ^    IG'P'O  ^  IG'P'O 

IG'lP*  IGPI  IG'P'l 

Step  3:  Perform  an  OTIS  move  on  all  processors. 

Step  4:  Perform  a  local  shuffle  in  each  group.  The  transformations  so  far  are  given 
below: 

OG'OP'          001*0           00*1  P'lOG'  P'IG'O 

OG'l/*           OG'P'l           OG'P'O  P'OOG  P'OG'O 

lG>QP>          IG'P'O   *2f   IG'P'O  ^   p>0W'  P'OG'l 

IG'IP          IG'P'l           IG'P'l  P'HC  P'IG'l 

Step  5:  This  step  involves  only  processors  in  even  groups.    In  these  groups,  odd 
processors  exchange  their  data  with  the  corresponding  even  processors. 


OG'OP' 

OG'P'O 

OG'P'l 

P'lQG' 

P'IG'O 

OG'lP' 

OG'P'l 

OG'P'O 

P'OOG1 

P'OG'O 

WOP*  ^ 

IG'P'O 

IG'P'O 

sj^*  p'QG'l 

laip1 

IG'P'l 

IG'P'l 

FIIG' 

P'IG'l 

P'IG'O 

P'OG'l 

P'OG'O 

P'IG'l 
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Step  6:  Perform  an  OTIS  move  on  all  processors. 

Step  7:  Same  as  Step  5.  The  seven  step  process  is  shown  below: 


OG'OF 

OG'FO 

OG'P'l 

P'lOG' 

FlG'O 

OG'IP1 

OG'PTJ 

P'OOG' 

FOG'O 

WOP1 

St»p  l    iQ'p'Q  Sup* 

IG'P'O 

p'oiC 

IG'lP1 

IG'Fl 

IG'Fl 

P'UG' 

FlG'l 

FIGO 

G'OP'l 

G'OP'O 

P'OCl 

G'IP'O 

ClP'O 

FQG'Q 

st»p «   G'OP'O   St*p  >7 

G'OP'l 

P'lCl 

ClP1! 

Step  5 


The  correctness  of  the  seven  step  algorithm  above  is  readily  seen.  From  the 
diagram  of  the  data  movement  operations,  we  see  that  data  originally  in  OG'OP1  end 
up  in  G'OP'O;  those  in  OG'IP'  end  up  in  G'IP'O;  those  in  WOP1  end  up  in  OG'IP'; 
and  those  in  IG'lP1  end  up  in  ClPl.  In  other  words,  data  are  moved  from  GP  to 
G"6p0,/2-i)P/6g(J(/2-i),  which  is  precisely  what  is  to  be  done  in  a  perfect  shuffle. 

Steps  1  and  4  perform  perfect  shuffles  in  VN  x  y/N  meshes.  Each  of  these 

can  be  done  optimally  in  2y/N  electronic  moves  using  the  algorithm  of  Nassimi  and 

Sahni  [34].  Steps  2,  5,  and  7  requires  exchanging  data  between  mesh  neighbors.  Each 

exchange  moves  data  in  opposite  directions  on  the  same  link  and  takes  two  electronic 

moves.  Steps  3  and  6  take  one  OTIS  move  each.  So,  the  total  number  of  moves  is 

Ay/N  +  6  electronic  and  2  OTIS  only.  In  contrast,  the  simulation  of  the  optimal  4D 

mesh  perfect  shuffle  algorithm  takes  iVN  electronic  and  4\/N  OTIS  moves. 

3.3  Unshuffle 

This  is  the  inverse  of  a  perfect  shuffle  and  may  be  done  by  running  the  seven 
step  shuffle  algorithm  backward  (i.e.,  beginning  with  Step  7)  and  replacing  the  local 
shuffles  of  Steps  1  and  4  by  local  unshuffles.  The  data  movement  is  shown  below 
{G  =  CrbamtP  =  P*bm). 
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G"OP"0 

G"0P"1 

P*1G"0 

P"1G"0 

P"10G" 

G"0P"1 

G"OP"0 

P"OG"0 

P"OG"l 

P"01G" 

Gn\P"0 

st«p «   p"QGn\   s"p ' 

P"OG"0  s-^4 

P"0O<r 

G"lP"l 

G"\P"1 

P"1G"1 

P"1G"1 

QG"P"\ 

OG"P"Q 

OC'OP" 

1G"P"0 

IG"P"0 

1G"0P" 

OG"P"0 

\G"P"\ 

1G"P"1 

The  number  of  data  moves  is  the  same  as  for  a  perfect  shuffle. 

3  4    Bit  Reversal 


A  bit  reversal  can  be  done  using  one  OTIS  and  &{y/N  - 1)  electronic  moves  as 
below.  Note  that  when  the  bit  reversal  is  done  by  simulating  the  optimal  4D  mesh 
algorithm,  8(>/N  -  1)  electronic  and  &(VN  -  1)  OTIS  moves  are  made. 

Step  1:  Do  a  local  bit  reversal  in  each  group. 

Step  2:  Perform  an  OTIS  move  of  all  data. 

Step  3:  Do  a  local  bit  reversal  in  each  group. 

Steps  1  and  3  are  done  optimally  in  4(y/N  -  1)  electronic  moves  each  using 

the  optimal  2D  mesh  bit  reversal  algorithm  of  Nassimi  and  Sahni  [34]. 

3  5    Vector  Reversal 

A  vector  reversal  can  be  done  using  S(VN - 1)  electronic  and  two  OTIS  moves. 
The  steps  are  as  follows: 

Step  1:  Perform  a  local  vector  reversal  in  each  group. 

Step  2:  Do  an  OTIS  move  of  all  data. 

Step  3:  Perform  a  local  vector  reversal  in  each  group. 


23 


Step  4:  Do  an  OTIS  move  of  all  data. 

Note  that  Step  1  moves  data  from  GP  to  GP  (where  P  is  the  complement  of 

P).  Step  2  moves  this  data  from  GP  to  PG.  Next,  Step  3  sends  that  data  to  PG 

and  finally  Step  4  sends  it  to  GP  completing  the  vector  reversal.  The  number  of  data 

moves  is  easily  obtained  by  noting  that  the  optimal  way  to  perform  the  local  vector 

reversals  takes  4(v/W  -  1)  electronic  moves  [34]. 

a.fi   Bit  Shuffle 

Our  algorithm  to  perform  this  permutation  employs  a  GyP t  Swap  permutation 

in  which  data  from  processor  GxGyPxPy  is  routed  to  processor  GxPxGyPy.  So,  let  us 

first  see  how  to  perform  this  permutation. 
3fi.1    (7rP.  Swap 

We  present  two  algorithms  for  this.  The  first  uses  2{y/N  -  1)  electronic  and 
log2  N  OTIS  moves.  The  second  uses  6(VN-1)  electronic  and  2  OTIS  moves.  While 
the  second  algorithm  uses  a  larger  number  of  moves,  it  is  to  be  preferred  when  the 
cost  of  an  OTIS  move  is  considerably  larger  than  that  of  an  electronic  move. 

The  first  algorithm  performs  a  series  of  bit  exchange  permutations  of  the  form 
B[i)  =  [fl^i,  •  •  • ,  Bo],  0  <  i  <  p/4,  where 


'  p/2  +  t,  i=p/4  +  i 
p/4  +  t,  ;=p/2  +  » 
j  otherwise 


The  permutation  B(i)  may  be  realized  as  below: 

Step  1:  Processors  GP  with  6cW  /  bP{p/i+i)  route  their  data  to  corresponding  pro- 
cessors that  differ  only  in  bit  bP{j,/4+i).  This  requires  moving  data  left  and  right 
on  rows  of       x  VN  meshes  by  T  positions  (in  each  direction). 

Step  2:  Perform  an  OTIS  move. 
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Step  3:  The  data  moved  in  Step  1  is  routed  from  their  current  processors  to  corre- 
sponding processors  that  differ  only  in  bit  i.  This  requires  data  moves  left  and 
right  along  rows  of  yfN  x  v^V  meshes.  The  distance  is  2«  in  each  direction. 

Step  4:  Perform  an  OTIS  move. 

The  total  number  of  moves  is  T+2  electronic  and  two  OTIS. 

To  perform  a  GyPx  Swap  permutation,  we  simply  perform  B{i)  permutations 
for  0  <  i  <  p/4.  This  takes  p/2  =  log2  N  OTIS  moves  and  EH*  2,+2  =  4(2^-1)  = 
A(y/~N  -  1)  electronic  moves. 

The  second  algorithm  uses  the  following  six  steps: 

Step  1:  Shift  data  in  group  GsGy  up  circularly  by  Gy  rows.  This  moves  the  datum 
from  processor  GxGyPxPy  to  processor  GxGy((Ps  -  Gv)  mod  VN)Py. 

Step  2:  Perform  an  OTIS  move.  The  datum  from  GsG,PxPy  is  now  in  ((Px-Gy)  mod 
VN)PyGxGy. 

Step  3:  In  each  group,  shift  the  data  right  circularly  along  the  rows  by  an  amount 
given  by  the  left  half  of  the  group  bits.  The  Datum  originally  in  GxGyPxPy  is 
now  in  ((Px  -  Gy)  mod  y/N)PyGxPx. 

Step  4:  Perform  an  OTIS  move.  The  datum  is  now  in  GXPX((PX  -  Gy)  mod  \ffi)Py. 

Step  5:  Move  data  up  circularly  along  columns  by  an  amount  given  by  the  Right  half 
of  the  group  bits.  The  datum  is  now  in  GxPx{-Gy  mod  y/N)Py. 

Step  6:  Reverse  the  order  of  data  in  each  column  of  each  group.  The  datum  is  now 
in  GzPxGyPy. 
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While  a  column  or  row  circular  shift  takes  VN  moves  in  each  group,  the 
number  of  moves  in  each  direction  varies  from  group  to  group.  Assuming  that  up 
moves  in  one  group  may  not  be  overlapped  with  down  moves  in  another,  Steps  1,  3, 
and  5  take  2(VN  -  1)  electronic  moves  each.  Step  6  may  be  combined  with  Step  5 
at  no  extra  cost.  So,  a  total  of  6(VN  -  1)  electronic  and  two  OTIS  moves  are  used. 
3fi.2   Bit  Shuffle 

A  bit  shuffle  may  be  performed  following  these  steps: 

Step  1:  Perform  a  GyPx  swap. 

Step  2:  Do  a  local  bit  shuffle  in  each  group. 

Step  3:  Do  an  OTIS  move. 

Step  4:  Do  a  local  bit  shuffle  in  each  group. 

Step  5:  Do  an  OTIS  move. 

Using  the  4(y/N  -  1)  electronic  and  log2  N  OTIS  move  algorithm  for  the 
GyPx  Swap  and  the  optimal  mesh  bit  shuffle  algorithm  of  Nassimi  and  Sahni  [34], 
the  number  of  moves  becomes  (approximately)  f  y/N  -  4  electronic  and  log2  N  +  2 
OTIS. 

3.7   Shuffled  Row-Maior 

This  is  the  inverse  of  a  bit  shuffle  and  may  be  done  in  the  same  number  of 

moves  by  running  the  bit  shuffle  algorithm  backwards.  Of  course,  Steps  2  and  4  are 

to  be  changed  to  shuffled  row-major  operations. 

3  ft    RPD  Permutations 

We  mentioned  in  the  beginning  that  the  permutations  of  the  previous  sections 
are  members  of  the  BPC  permutation  class.  In  this  section  we  present  the  definition 
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of  the  BPC  permutation,  its  relation  to  those  permutations,  and  the  algorithm  to 

realize  it. 

3  ft  l  Pefinition 

In  a  BPC  permutation,  the  destination  processor  of  each  data  is  given  by  a 
rearrangement  of  the  bits  in  the  source  processor  index.  For  the  case  of  our  N* 
processor  OTIS-Mesh  we  assume  that  N  is  a  power  of  two  and  so  the  number  of  bits 
needed  to  represent  a  processor  index  is  p  =  log,  N*  =  2logN.  A  BPC  permutation 
of  Nassimi  and  Sahni  [34]  is  specified  by  a  vector  A  =  [Ap-U  Av.2, . . . ,  A>]  where 

(a)  4€{±0,±l,...,±(p-l)}>0<t<pand 

(b)  [| Vil.  I  V»l.  •  •  • '  Ml  te  a  Pe™111***'011  of  [0, 1, - . . ,p  —  lj- 

The  destination  for  the  data  in  any  processor  may  be  computed  in  the  following 
manner.  Let  m^m,., ...  mo  be  the  binary  representation  of  the  processor's  index. 
Let  dp-i<^-2 . .  do  be  that  of  the  destination  processor's  index.  Then, 

,         [mi        if  Ai  >  0, 
rfW  "  \  1  -  rm  if  Ai<0. 

In  this  definition,  -0  is  to  be  regarded  as  <  0,  while  +0  is  >  0. 

In  a  16  processor  OTIS-Mesh,  the  processor  indices  have  four  bits  with  the 
first  two  giving  the  group  number  and  the  second  two  the  local  processor  index.  The 
BPC  permutation  [-0,1,2,-3]  requires  data  from  each  processor  m3m2mxmQ  be 
routed  to  processor  (1  -  rn^m^il  -  m3).  Table  3.2  lists  the  source  and  destination 
processors  of  the  permutation. 

The  permutation  vector  A  for  each  of  the  permutations  of  Table  3.1  is  given 

in  Table  3.3. 
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Table  3.2.  Source  and  destination  of  the  BPC  permutation  [-0, 1,2,  -3]  i 
processor  OTIS-Mesh 


Source 

Destination 

r  rocessor 

Binary 

Binary 

(G,P) 

Processor 

n 
u 

0000 

1001 

(2,1) 

9 

i 
i 

0001 

0001 

(0,1) 

1 

o 
t 

0010 

\J\J  X  \J 

1101 

(3,1) 

13 

3 

(0,3) 

0011 

0101 

(1,1) 

5 

4 

(1,0) 

0100 

1011 

(2,3) 

11 

5 

(1,1) 

0101 

0011 

(0,3) 

3 

6 

(1,2) 

0110 

1111 

(3,3) 

15 

7 

(1,3) 

0111 

0111 

(1,3) 

7 

8 

(2,0) 

1000 

1000 

(2,0) 

8 

9 

(2,1) 

1001 

0000 

(0,0) 

0 

10 

(2,2) 

1010 

1100 

(3,0) 

12 

11 

(2,3) 

1011 

0100 

(1,0) 

4 

12 

(3,0) 

1100 

1010 

(2,2) 

10 

13 

(3,1) 

1101 

0010 

(0,2) 

2 

14 

(3,2) 

1110 

1110 

(3,2) 

14 

15 

(3,3) 

1111 

0110 

(1,2) 

6 

Table  3.3.  Permutations  and  their  permutation  vectors 


Permutation 

Permutation  Vector 

Transpose 
Perfect  Shuffle 

Unshuffle 
Bit  Reversal 
Vector  Reversal 
Bit  Shuffle 
Shuffled  Row-major 

[p/2-l,...,0,p-l,...,p/2j 
[0,p-l,p-2,...,l] 
[p-2,p-3,...,0,p-l] 
[0,1,.  ...p-i] 
Hp-i),-(p-2),...,-o] 

[p-l,p-3,...,l,p-2,p-4,...,0] 
[p  -  l,p/2  -  l,p  -  2,p/2  -  2, . . .  ,p/2,0] 
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38  2  Algorithm 

Every  BPC  permutation,  A,  may  be  realized  by  a  sequence  of  bit  exchange 

permutations  of  the  form  B(i,j)  =  [B^u . .  .,flb],  p/2  <  •  <  p,  0  <  j  <  p/2,  and 

f  h   ?  =  * 
Bq  =  <  »,    9  =  i 

I  otherwise, 

and  a  BPC  permutation  (7  =  [C,-!,  - . . ,  Co]  =  nGUP  where  |C,|  <  p/2,  0  <  q  <  p/2, 
nG  and  nP  involve  p/2  bits  each.  Let  U'G  be  the  permutation  obtained  from  IIG  by 
subtracting  p/2  from  each  entry  whose  absolute  value  exceeds  p/2  -  1.  For  example, 
if  nG  =  [-3, 5, 4],  then  p  =  6  and  n'G  =  [-0, 2, 1]. 

The  transpose  permutation  may  be  realized  by  the  sequence  B(p/2  +j,j),  0  < 
j  <  p/2;  bit  reversal  is  equivalent  to  the  sequence  B(p- 1  -  J,  j),  0  <  j  <  p/2;  vector 
reversal  can  be  realized  by  performing  no  bit  exchanges  and  using  C  =  [-(p-1),  ~(p- 
2), . . . ,  -0]  (Tla  =  [-(p-1),  -(P-2), ....  -p/2],  UP  m  [-(p/2-1), . . . ,  -0]) ;  perfect 
shuffle  may  be  decomposed  into  £(p/2, 0)  and  C  =  [p  -  2,p  -  3, . . .  ,p/2,p  -  l,p/2  - 
2,...,l,0,p/2-l]  (nc  =  b-2,p-3,...,p/2,p-l],  nP  =  [p/2-2,...,l,0,p/2-l]). 

A  bit  exchange  permutation  B(iJ)  may  be  performed  in  2*  +  21'  electronic, 

where 

.,_/i-p/2,  i<3p/4 
1  3p/4,  i  >  3p/4; 


j,  j  <  P/4 

J-p/4,  j>p/4 


and  2  OTIS  moves  following  a  process  similar  to  that  used  for  B(i)  in  Section  3.6.1. 
Our  algorithm  for  general  BPC  permutations  is: 

Step  1:  Decompose  the  BPC  permutation  A  into  the  bit  exchange  permutations 
52(t2,j2),...,  Bk{ik,jk)  and  the  BPC  permutation  C  =  nGUP  as 
above.  Do  this  such  that  ij  >  it  >  •  •  •  >  t't,  and  jx  >  fa  >  •  •  •  >  jt- 
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Step  2:  If  k  =  0,  do  the  following: 

Step  2.1:  Do  the  BPC  permutation  UP  in  each  group  using  the  optimal  algo- 
rithm of  Nassimi  and  Sahni  [34]. 
Step  2.2:  Do  an  OTIS  move. 

Step  2.3:  Do  the  BPC  permutation  U'G  in  each  group  using  the  algorithm  of 

Nassimi  and  Sahni  [34]. 
Step  2.4:  Do  an  OTIS  move. 

Step  3:  If  fc  =  p/2,  do  the  following: 

Step  3.1:  Do  the  BPC  permutation  II^  in  each  group. 
Step  3.2:  Do  an  OTIS  move. 

Step  3.3:  Do  the  BPC  permutation  UP  in  each  group. 
Step  4:  Uk<  p/4,  do  the  following: 

Step  4.1:  Perform  the  bit  exchange  permutation  Bu . . . ,  Bk. 
Step  4.2:  Do  Steps  2.1  through  2.4. 

Step  5:  \ik>        do  the  following: 

Step  5.1:  Perform  a  sequence  of  p/2  -  k  bit  exchanges  involving  bits  other 
than  those  in  Bu...,Bk  in  the  same  orderly  fashion  described  in  Step  1. 
Recompute  YIq  and  lip.  Swap  lie  and  lip. 

Step  5.2:  Do  Steps  3.1  through  3.3. 
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Consider  the  permutation  A  =  [6,11,3,8,10,7,0,4,13,14,2,9,1,15,5,12]  in 
a  216  processor  OTIS-Mesh  (we  have  omitted  complements  for  simplicity;  bit  com- 
plements can  be  taken  care  of  when  the  local  BPC  permutations  UG  and  UP  are  per- 
formed). For  this,  the  decomposition  of  Step  1  yields  Bx  =  5(15,7),  52  =  5(13,6), 
53  =  £(10,4),  B<  =  5(9,2),  and  55  =  5(8,0),  nG  =  [13,11,14,8,10,9,15,12], 
Up  =  [6,3,2,7,1,0,5,4].  Since  k  =  5  >  p/4  =  4,  we  go  to  Step  5.  First  we 
perform  a  sequence  of  bit  exchanges  on  bits  not  in  Bx  through  55;  i.e.,  5(14,5), 
5(12,3),  and  5(11,1).  Recomputing  nG  and  UP,  we  get  nc  =  [6,2,3,1,5,7,0,4] 
and  UP  =  [13,14,11,9,8,15,10,12].  Next,  Steps  3.1  through  3.3  are  done.  The 
sequence  of  data  moves  is  shown  below: 

(&15&14&13&12&11&10&9&8&7&6M4&3&2&1M 

£(13  3) 

(bibhbi3bi2biibiob9bib7b6biAb4b3b2b1b0)  — 

(^5Ml3&3^1&10&9&8&7&6&14Ml2&2&lM 

(bisb5bi3b3bibiob9bib7b6bubAbi2b7b\il>o) 
(bishbubzbibiobvbshbebjbobiibnbAbii) 

(btbsbybobububibubisbsbiibibibiobtbi) 
(hbsbtbQbubiibtbubiQbisbibsbisbsbzbg) 

It  can  be  verified  that  the  resulting  position  is  exactly  the  destination  that 
the  original  BPC  permutation  A  dictates. 

The  local  BPC  permutations  determined  by  Tic  and  UP  take  at  most  4(y/N- 
1)  electronic  moves  each  [34];  the  bit  exchanges  cumulatively  take  at  most  4(y/N  - 
1)  electronic  and  log2  N  OTIS  moves.  So,  the  total  number  of  moves  is  at  most 

12(VN  -  1)  electronic  and  log2  N  +  2  OTIS. 

3  Q  Comparison 

Table  3.4  lists  the  complexities  of  the  algorithms  for  the  commonly  used  per- 
mutations developed  in  this  chapter  along  with  the  complexities  of  algorithms  that 


o 
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Table  3.4.  Complexity  Comparison  of  Common  Data  Rearrangement 


Permutation 

Simulation 

Ours 

electronic 

OTIS 

electronic 

oris 

Transpose 

S(VN-l) 

B(VN  -  1) 

0 

1 

Perfect  Shuffle 

Wn 

Wn 

4^  +  6 

2 

Unshuffle 

Wn 

Ay/N 

4>/N  +  6 

2 

Bit  Reversal 

&(VN  - 1) 

8(VN  -  1) 

&(VN-1) 

1 

Vector  Reversal 

8(vW  - 1) 

S(y/N  -  1) 

S(VN  -  1) 

2 

Bit  Shuffle 

fv/N-4 

log2JV  +  2 

Shuffled  Row-major 

fv/^-4 

^v^V-4 

log2  N  +  2 

use  the  simulation  method  of  Zane  et  oL  [58].  It  is  clear  that  each  of  our  algorithm 
outperforms  the  simulation  by  a  good  margin. 


CHAPTER  4 
BASIC  OPERATIONS  ON  AN  OTIS-MESH 

In  this  chapter,  we  develop  deterministic  OTIS-Mesh  algorithms  for  the  basic 
data  operations  for  parallel  computation  that  are  studied  in  Ranka  and  Sahni  [42], 
such  as  broadcast,  window  broadcast,  prefix  sum,  rank,  shift,  sort,  random  access  read 
and  write.  As  shown  in  [42],  algorithms  for  these  operations  can  be  used  to  arrive 
at  efficient  parallel  algorithms  for  numerous  applications,  from  image  processing, 
computational  geometry,  matrix  algebra,  graph  theory,  and  so  forth. 

We  consider  both  the  synchronous  SIMD  and  synchronous  MIMD  models. 
In  both,  all  processors  operate  in  lock-step  fashion.  In  the  SIMD  model,  all  active 
processors  perform  the  same  operation  in  any  step  and  all  active  processors  move 
data  along  the  same  dimension  or  along  OTIS  connections.  In  the  MIMD  model, 
processors  can  perform  different  operations  in  the  same  step  and  can  move  data 
along  different  dimensions. 

4,1    Pftf  *  Broadcast 
Data  broadcast  is,  perhaps,  the  most  fundamental  operation  for  a  parallel 
computer.  In  this  operation,  data  that  is  initially  in  a  single  processor  (G,P)  is  to 
be  broadcast  or  transmitted  to  all  N*  processors  of  the  OTIS-Mesh.  Data  broadcast 
can  be  accomplished  using  the  following  three  step  algorithm: 

Step  1:  Processor  (G,P)  broadcasts  its  data  to  all  other  processors  in  group  G. 

Step  2:  Perform  an  OTIS  move. 

Step  3:  Processor  G  of  each  group  broadcasts  the  data  within  its  group. 
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Following  Step  2,  one  processor  of  each  group  has  a  copy  of  the  data,  and 

following  Step  3  each  processor  of  the  OTIS-Mesh  has  a  copy.  In  the  SIMD  model, 

Steps  1  and  3  take  2(y/N  -  1)  electronic  moves  each,  and  Step  2  takes  one  OTIS 

move.  The  SIMD  complexity  is  4(s/N  -  1)  electronic  moves  and  1  OTIS  move,  or  a 

total  of  4^-3  moves.  Note  that  our  algorithm  is  optimal  because  the  diameter  of 

the  OTIS-Mesh  is  4y/N  -  3  (Section  2.1).  For  example,  if  the  data  to  be  broadcast 

is  initially  in  processor  (0,0),  the  data  needs  to  reach  processor  (AT- 1, AT- 1),  which 

is  at  a  distance  of  4y/N  -  3.  In  the  MIMD  model,  the  complexity  of  Steps  1  and  3 

depends  on  the  value  of  P  =  (P„  P,)  and  ranges  from  a  low  of  approximately  y/N- 1 

to  a  high  of  2(VN- 1).  The  overall  complexity  is  at  most  4{y/N-l)  electronic  moves 

and  one  OTIS  move.  By  contrast,  simulating  the  4D-mesh  broadcast  algorithm  using 

the  simulation  method  of  [58]  takes  4(^-1)  electronic  moves  and  4(>/A7- 1)  OTIS 

moves  in  the  SIMD  model  and  up  to  this  many  moves  in  the  MIMD  model. 

4fl   Window  Broadcast 

In  a  window  broadcast,  we  start  with  data  in  the  top  left  w  x  w  submesh  of  a 
single  group  G.  Here  w  divides         Following  the  window  broadcast  operation,  the 
initial  id  x  10  window  tiles  all  groups;  that  is,  the  window  is  broadcast  both  within 
and  across  groups.  Our  algorithm  for  window  broadcast  is: 
Step  i:  Do  a  window  broadcast  within  group  G. 
Step  2:  Perform  an  OTIS  move. 

Step  S:  Do  an  intragroup  data  broadcast  from  processor  G  of  each  group. 

Step  4:  Perform  an  OTIS  move. 

Following  Step  1  the  initial  window  properly  tiles  group  G  and  we  are  left  with 
the  task  of  broadcasting  from  group  G  to  all  other  groups.  In  Step  2,  data  d(G,  P) 
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from  (G,P)  is  moved  to  (P,G)  for  0  <  P  <  N.  In  Step  3,  d(G,P)  is  broadcast 
to  all  processors  (P,i),  0  <  P,i  <  N,  and  in  Step  4  d(G,P)  is  moved  to  (i,P), 
0  <  »',  P  <  AT. 

Step  1  of  our  window  broadcast  algorithm  takes  2(y/N  -  w)  electronic  moves 
in  both  the  SIMD  and  MIMD  models,  and  Step  3  takes  2(s/N  -  1)  electronic  moves 
in  the  SIMD  model  and  up  to  2(y/N-l)  electronic  moves  in  the  MIMD  model.  The 
total  cost  is  4VN-2w-2  electronic  and  2  OTIS  moves  in  the  SIMD  model  and  up  to 
this  many  moves  in  the  MIMD  model.  A  simulation  of  the  4D  mesh  window  broadcast 
algorithm  takes  the  same  number  of  electronic  moves,  but  also  takes  4(VN- 1)  OTIS 
moves. 

4  3    Prefix  Sum 

The  index  (G,  P)  of  a  processor  may  be  transformed  into  a  scalar  I  =  GN+P 
with  0  <  /  <  N*.  Let  D(I)  be  the  data  in  processor  1,0  <  I  <  N2.  In  a  prefix 
sum,  each  processor  /  computes  5(7)  =  ELo  *>(»),  0  <  /  <  N2.  A  simple  prefix  sum 
algorithm  results  from  the  following  observation: 

S{I)  =  SD(I)  +  LP{I) 

where  SD{I)  is  the  sum  of  D{i)  over  all  processors  t  that  are  in  a  group  smaller  than 
the  group  of  /  and  LP(I)  is  the  local  prefix  sum  within  the  group  of  /.  The  simple 
prefix  sum  algorithm  is  as  follows: 

Step  1:  Perform  a  local  prefix  sum  in  each  group. 

Step  2:  Perform  an  OTIS  move  of  the  prefix  sums  computed  in  Step  1  for  all  proces- 
sors (G,  N  - 1). 
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Step  3:  Group  N  -  1  computes  a  modified  prefix  sum  of  the  values,  A,  received 
Step  2.  In  this  modification,  processor  P  computes  EiJo  ^(0  rather  than 


in 


Step  4:  Perform  an  OTIS  move  of  the  modified  prefix  sums  computed  in  Step  3. 

Step  5:  Each  group  does  a  local  broadcast  of  the  modified  prefix  sum  received  by  its 
N  —  1  processor. 

Step  6:  Each  processor  adds  the  local  prefix  sum  computed  in  Step  1  and  the  modified 
prefix  sum  it  received  in  Step  5. 

The  local  prefix  sums  of  Steps  1  and  3  take  3(y/N  -  1)  electronic  moves  in 
both  the  SIMD  and  MIMD  models,  and  the  local  data  broadcast  of  Step  5  takes 
2(jN-\)  electronic  moves.  The  overall  complexity  is  8(v^  -  1)  electronic  moves 
and  2  OTIS  moves.  This  can  be  reduced  to  7(VN  -  1)  electronic  moves  and  2  OTIS 
moves  by  deferring  some  of  the  Step  1  moves  to  Step  5  as  below. 

Step  1:  In  each  group,  compute  the  row  prefix  sums  R. 

Step  2:  Column  y/N  -  1  of  each  group  computes  the  modified  prefix  sums  of  its  R 
values. 

Step  S:  Perform  an  OTIS  move  on  the  prefix  sums  computed  in  Step  2  for  all  pro- 
cessors (G,N  -  1). 

Step  4:  Group  N  -  1  computes  a  modified  prefix  sum  of  the  values,  A,  received  in 
Step  3. 

Step  5:  Perform  an  OTIS  move  of  the  modified  prefix  sums  computed  in  Step  4. 
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Step  6:  Each  group  broadcasts  the  modified  prefix  sum  received  in  Step  5  along 
column  >/N  -  1  of  its  mesh. 

Step  7:  The  column  y/N  - 1  processors  add  the  modified  prefix  sum  received  in  Step 
6  and  the  prefix  sum  of  R  values  computed  in  Step  2  minus  its  own  R  value 
computed  in  Step  1. 

Step  8:  The  result  computed  by  column  VN  -  1  processors  in  Step  7  is  broadcast 
along  mesh  rows. 

Step  9:  Each  processor  adds  its  R  value  and  the  value  it  received  in  Step  8. 

If  we  simulate  the  best  4D  mesh  prefix  sum  algorithm,  the  resulting  OTIS 

mesh  algorithm  takes  7(VN  -  1)  electronic  and  6(v/N  -  1)  OTIS  moves. 

4.4    Data  Sum 

In  this  operation,  each  processor  is  to  compute  the  sum  of  the  D  values  of  all 
processors.  An  optimal  SIMD  data  sum  algorithm  is  as  follows: 

Step  1:  Each  group  performs  the  data  sum. 

Step  2:  Perform  an  OTIS  move. 

Step  3:  Each  group  performs  the  data  sum. 

In  the  SIMD  model  Steps  1  and  3  take  4(VN- 1)  electronic  moves,  and  step  2 
takes  1  OTIS  move.  The  total  cost  is  S(y/N  -  1)  electronic  and  1  OTIS  moves.  Note 
that  since  the  distance  between  processors  (0,0)  and  (N  -  1,N  -  1)  is  4(y/N  -  1) 
electronic  and  1  OTIS  moves  and  since  each  needs  to  get  information  from  the  other, 
at  least  8{VN  -  1)  electronic  and  1  OTIS  moves  are  needed  (the  moves  needed  to 
send  information  from  (0, 0)  to  (N  - 1,  N  - 1)  and  those  from  (N  -  1,  N  - 1)  to  (0, 0) 
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cannot  be  overlapped  in  the  SIMD  model).  Also,  note  that  a  simulation  of  the  4D 
mesh  data  sum  algorithm  takes  &(\/N  -  1)  electronic  and  8(\/N  -  1)  OTIS  moves. 

The  MIMD  complexity  can  be  reduced  by  computing  the  group  sums  in  the 
middle  processor  of  each  group  rather  than  in  the  bottom  right  processor.  The 
complexity  now  becomes  4(\/N  -  1)  electronic  and  1  OTIS  moves  when  is  odd 
and  4\/N  electronic  and  1  OTIS  moves  when  y/N  is  even.  The  simulation  of  the  4D 
mesh,  however,  takes  4(y/N-l)  electronic  and  4(y/N  -  1)  OTIS  moves.  Notice  that 
the  MIMD  algorithm  is  near  optimal  as  the  diameter  of  the  OTIS-Mesh  is  4V^  -  3 
(Section  2.1). 

4.5  Rank 

In  the  rank  operation,  each  processor  /  has  a  flag  5(7)  €  {0, 1},  0  <  I  <  N7. 
We  are  to  compute  the  prefix  sums  of  the  processors  with  S(I)  =  1.  This  operation 
can  be  performed  in  1(\/N  -  1)  electronic  and  2  OTIS  moves  using  the  prefix  sum 
algorithm  of  Section  4.3. 

4.6  Shift 

Although  there  are  many  variations  of  the  shift  operation,  the  ones  we  believe 
are  most  useful  in  application  development  are  as  follows: 

(a )  mesh  row  shift  with  zero  fill — in  this  we  shift  data  from  processor  (Gs,  Gy,Px,Pf) 

to  processor  (GX1  Gy,  P„  P,  +  «),  -y/N  <s<  y/N.  The  shift  is  done  with  zero 
fill  and  end  discard  (i.e.,  if  P,  +  5  >  y/N  or  Py  +  s  <  0,  the  data  from  P,  is 
discarded). 

(b)  mesh  column  shift  with  zero  fill — similar  to  (a),  but  along  mesh  column  Px. 

(c)  circular  shift  on  a  mesh  row — in  this  we  shift  data  from  processor  (G„  Gy}  Px,  Pt) 

to  processor  (Gx,  Gy,Px,  (P,  +  s)  mod  \/N). 
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(d)  circular  shift  on  a  mesh  column— similar  to  (c),  but  instead  Px  is  used. 

(e)  group  row  shift  with  zero  fill— similar  to  (a),  except  that  Gv  is  used  in  place  of 

(f)  group  column  shift  with  zero  fill— similar  to  (e),  but  along  group  column  Gx. 

(g)  circular  shift  on  a  group  row— similar  to  (c),  but  with  Gv  rather  than  Pv. 

(h)  circular  shift  on  a  group  column— similar  to  (g),  with  Gx  in  place  of  Gy. 

Shifts  of  types  (a)  through  (d)  are  done  using  the  best  mesh  algorithms  while 
those  of  types  (e)  through  (h)  are  done  as  below: 

Step  1:  Perform  an  OTIS  move. 

Step  2:  Do  the  shift  as  a  Px  (if  originally  a  Gx  shift)  or  a  P,  (if  originally  a  Gy  shift) 
shift. 

Step  S:  Perform  an  OTIS  move. 

Shifts  of  types  (a)  and  (b)  take  s  electronic  moves  on  the  SIMD  and  MIMD 
models;  (c)  and  (d)  take  y/N  electronic  moves  on  the  SIMD  model  and  max{|s|,  y/N- 
\s\)  electronic  moves  on  the  MIMD  model;  (e)  and  (f)  take  *  electronic  and  2  OTIS 
moves  on  both  SIMD  and  MIMD  models;  and  (g)  and  (h)  take  y/N  electronic  and 
2  OTIS  moves  on  the  SIMD  model  and  max{|s|,  y/N  -  \s\)  electronic  and  2  OTIS 
moves  on  the  MIMD  model. 

If  we  simulate  the  corresponding  4D  mesh  algorithms,  we  obtain  the  same 
complexity  for  (a)— (d),  but  (e)  and  (f)  take  an  additional  2s  -  2  OTIS  moves,  and 
(g)  and  (h)  take  an  additional  2  x  max{|s|,  y/N  -  \s\}  -  2  OTIS  moves. 
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4,7  Data  Arr"m"lat'on 
Each  processor  is  to  accumulate  M,  0  <  M  <  v/N,  values  from  its  neighboring 
processors  along  one  of  the  four  dimensions  Gx,  Gy,  P„  Py-  Let  D(Gx,Gy,Px,Py) 
be  the  data  in  processor  (Gx,GyiPx,Py).  In  a  data  accumulation  along  the  Gx 
dimension  (for  example),  each  processor  (Gx,Gy,Px,Py)  accumulates  in  an  array  A 
the  data  values  from  ((Gx  +  i)  mod  ^,Gy,PX}Py),  0  <  i  <  M.  Specifically,  we 
have 

A\i]  =  D{(GS  +  i)  mod  y/N,  Gy,  Px,  Py) 

Accumulation  in  other  dimensions  is  similar. 

The  accumulation  operation  can  be  done  using  a  circular  shift  of  -M  in  the 
appropriate  dimension.  The  complexity  is  readily  obtained  from  that  for  the  circular 
shift  operation  (see  Section  4.6). 

4.ft  Cauaeaitia  Sum 

The  JV2  processor  OTIS-Mesh  is  tiled  with  one-dimensional  blocks  of  size  M. 
These  blocks  may  align  with  any  of  the  four  dimensions  Gs,  Gy,  PX1  and  Py.  Each 
processor  has  M  values  X\j],  0<j<M.  The  ith  processor  in  a  block  is  to  compute 
the  sum  of  the  X[i]'s  in  that  block.  Specifically,  processor  t  of  a  block  computes 

S(i)=  ZX[i\{j),0<i<M 

3=0 

where  t  and  j  are  indices  relative  to  a  block. 

When  the  one-dimensional  blocks  of  size  M  align  with  the  Px  or  Py  dimensions, 
a  consecutive  sum  can  be  performed  by  using  M  tokens  in  each  block  to  accumulate 
the  M  sums  5(i),  0  <  t  <  M.  Assume  the  blocks  align  along  Px.  Let  po,Pi, .  •  ■  ,Pm-i 
be  the  M  processors,  left-to-right,  in  a  block.  The  consecutive  sum  algorithm  works 
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in  two  phases.  In  the  first  phase,  processor  M  -  1  initiates  tokens  lo»*i»  •  • »  <m-2 
one  by  one.  These  tokens  move  leftwards.  When  a  processor  receive  token  U,  >* 
adds  its  X[i]  value  to  it  and  transmits  the  token  to  the  processor  on  its  left.  The 
first  phase  operates  for  M  -  1  moves  and  at  the  end  of  this  phase,  Pi  has  token 
ti  =  EjlT+i^WO)-  The  second  phase  is  similar  to  the  first.  This  time,  po  initiates 
the  tokens  tfu_v  fM_2t . . . ,  tx  and  the  tokens  move  rightwards.  Following  A/-1  moves, 
token  fi  is  in  processor  pi  and  =  Z$*JT[t|(j).  Following  phase  2,  p{  computes  the 
desired  result  U  +  <£  +  X[i](i).  The  total  number  of  moves  is  2(M  -  1). 

In  the  MIMD  model,  the  left  and  right  moves  can  be  done  simultaneously, 
and  only  AT  -  1  electronic  moves  are  needed. 

When  the  one-dimensional  size  M  blocks  align  with  Gx  or  Gy,  we  first  do  an 
OTIS  move;  then  run  either  a  P,  or  P,  consecutive  sum  algorithm;  and  then  do  an 
OTIS  move.  The  number  of  electronic  moves  is  the  same  as  for  Px  or  Pv  alignment. 
However,  two  additional  OTIS  moves  are  needed. 

Simulation  of  the  corresponding  4D  mesh  algorithm  takes  an  additional  AM -6 

OTIS  moves  for  the  case  of  Gx  or  G,  alignment  in  the  SIMD  model  and  an  additional 

2M  -  4  OTIS  moves  in  the  MIMD  model. 

4  9    Adjacent  Sum 

This  operation  is  similar  to  the  data  accumulation  operation  of  Section  4.7 

except  that  the  M  accumulated  values  are  to  be  summed.  The  operation  can  be  done 

with  the  same  complexity  as  data  accumulation  using  a  similar  algorithm. 

410  Concentrate 

A  subset  of  the  processors  contain  data.  These  processors  have  been  ranked 
as  in  Section  4.5.  So  the  data  is  really  a  pair  (D,r);  D  is  the  data  in  the  processor 
and  r  is  its  rank.  Each  pair  (D,  r)  is  to  be  moved  to  processor  r,  0  <  r  <  6,  where 
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b  is  the  number  of  processors  with  data.  Using  the  (G,  P)  format  for  a  processor 
index,  we  see  that  (D,r)  is  to  be  routed  from  its  originating  processor  to  processor 
([r/JVj.r  mod  N).  We  accomplish  this  using  the  steps: 

Step  1:  Each  pair  (D,r)  is  routed  to  processor  r  mod  N  within  its  current  group. 
Step  2:  Perform  an  OTIS  move. 

Step  3:  Each  pair  (D,r)  is  routed  to  processor  [r/N\  within  its  current  group. 
Step  4:  Perform  an  OTIS  move. 

Thpnrrm  1.10.1  The  four  step  algorithm  given  above  correctly  routes  every  pair  (D,  r) 
to  processor  ([r/N\,r  mod  N). 

Proof  Step  1  does  the  routing  on  the  second  coordinate.  This  step  does  not  route 
two  pairs  to  the  same  processor  provided  no  group  has  two  pairs  (Z>i,ri),  (£j,r2) 
with  mod  N  =  r2  mod  N.  Since  each  group  has  at  most  N  pairs  and  the  ranks  of 
these  pairs  are  contiguous  integers,  no  group  can  have  two  pairs  with  ri  mod  N  = 
r2  mod  N.  So  following  Step  1  each  processor  has  at  most  one  pair  and  each  pair  is 
in  the  correct  processor  of  the  group,  though  possibly  in  the  wrong  group. 

To  get  the  pairs  to  their  correct  groups  without  changing  the  within  group 
index,  Step  2  performs  an  OTIS  move,  which  moves  data  from  processor  (G,P)  to 
processor  (P,  G).  Now  all  pairs  in  a  group  have  the  same  r  mod  N  value  and  different 
[r/N\  values.  The  routing  on  the  [r/N\  values,  as  in  Step  3,  routes  at  most  one  pair 
to  each  processor.  The  OTIS  move  of  Step  4,  therefore,  gets  every  pair  to  its  correct 

destination  processor.  □ 

In  group  0,  Step  1  is  a  concentrate  localized  to  the  group,  and  in  the  remaining 
groups,  Step  1  is  a  generalized  concentrate  in  which  the  ranks  have  been  increased 
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by  the  same  amount.  In  all  groups  we  may  use  the  mesh  concentrate  algorithm  of 
Nassimi  and  Sahni  [35]  to  accomplish  the  routing  in  4(VN  -  1)  electronic  moves. 
Step  3  is  also  a  concentrate  as  the  [r/N\  values  of  the  pairs  are  in  ascending  order 
from  0, 1, 2,  •  •  •.  So  Steps  1  and  3  take  i(y/N  -  1)  electronic  moves  each  in  the  SIMD 
model  and  2{s/N  -  1)  in  the  MIMD  model  [35].  Therefore,  the  overall  complexity 
of  concentrate  is  S{VN  -  1)  electronic  and  2  OTIS  moves  in  the  SIMD  model  and 
4(vyN  -  1)  electronic  and  2  OTIS  moves  in  the  MIMD  model. 

We  can  improve  the  SIMD  time  to  7(y/N  -  1)  electronic  and  2  OTIS  moves 
by  using  a  better  mesh  concentrate  algorithm  than  the  one  in  Nassimi  and  Sahni 
[35].  The  new  and  simpler  algorithm  is  given  below  for  the  case  of  a  generalized 
concentration  on  a  y/N  x  y/N  mesh. 

Step  1:  Move  data  that  are  to  be  in  a  column  right  of  the  current  one  rightwards  to 

the  proper  processor  in  the  same  row. 
Step  2:  Move  data  that  are  to  be  in  a  column  left  of  the  current  one  leftwards  to  the 

proper  processor  in  the  same  row. 
Step  3:  Move  data  that  are  to  be  in  a  smaller  row  upwards  to  the  proper  processor 

in  the  same  column. 

Step  4:  Move  data  that  are  to  be  in  a  bigger  row  downwards  to  the  proper  processor 
in  the  same  column. 
In  a  concentrate  operation  on  a  square  mesh  the  data  that  begin  in  two  pro- 
cessors of  the  same  row  ends  up  in  different  columns  as  the  ranks  of  these  two  data 
differ  by  at  most  y/N- 1-  So  Steps  1  and  2  do  not  leave  two  or  more  data  in  the  same 
processor.  Steps  3  and  4  get  data  to  the  proper  row  and  hence  to  the  proper  proces- 
sor. Note  that  it  is  possible  to  have  up  to  two  data  items  in  a  processor  following 
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Step  1  and  Step  3.  The  complexity  of  the  above  concentrate  algorithm  is  4(y/N  -  1) 
on  a  SIMD  mesh  and  2(y/N  -  1)  on  an  MIMD  mesh  (we  can  overlap  Steps  1  and  2 
as  well  as  Steps  3  and  4  on  an  MIMD  mesh). 

For  an  ordinary  concentrate  in  which  the  ranks  begin  at  1,  Step  4  can  be 
omitted  as  no  data  moves  down  a  column  to  a  row  with  bigger  index.  So  an  ordinary 
concentrate  takes  only  Z(y/N  -  1)  moves.  This  improves  the  SIMD  concentration 
algorithm  of  Nassimi  and  Sahni  [35],  which  takes  4(y/N- 1)  moves  to  do  an  ordinary 
concentrate. 

Actually,  we  can  show  that  the  four  step  concentration  algorithm  just  stated 
is  optimal  for  the  SIMD  model.  Consider  the  ordinary  concentrate  instance  in  which 
the  selected  elements  are  in  processors  (0,  y/N  -  1),  (1,  y/N  -  2),  •  •  •,  (y/N  -  1,0). 
The  ranks  are  0, 1,  •  •  •,  V^- 1.  So  the  data  in  processor  (0,  y/N- 1)  is  to  be  moved 
to  processor  (0,0).  This  requires  moves  that  yield  a  net  of  y/N  -  1  left  moves.  Also, 
the  data  in  processor  (y/N  -  1,0)  is  to  be  moved  to  processor  (0,  y/N  -  1).  This 
requires  a  net  of  y/N- 1  upward  moves  and  y/N -I  rightward  moves.  None  of  these 
moves  can  be  overlapped  in  the  SIMD  model.  So  every  SIMD  concentrate  algorithm 
must  take  at  least  y/N  -  1  moves  in  each  of  the  directions  left,  right,  and  up;  a  total 
of  at  least  3(y/N  -  1)  moves. 

For  the  generalized  concentrate  algorithm,  the  ranks  need  not  start  at  zero. 
Suppose  we  have  two  elements  to  concentrate.  One  is  at  processor  (0,0)  and  has  rank 
N  -  1,  and  the  other  is  at  processor  (y/N  -  1,  y/N  - 1)  and  has  rank  N.  The  data  in 
(0,0)  is  to  be  moved  to  (y/N  -  1,  y/N  -  1)  at  a  cost  of  y/N  -  1  net  right  and  down 
moves.  The  data  in  (y/N  -  1,  y/N  -  1)  is  to  be  moved  to  (0,0)  at  a  cost  of  y/N  -  1 
net  left  and  up  moves.  So  at  least  4(VN  -  1)  moves  are  needed. 
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Table  4.1.  Processors  with  data  to  concentrate 


0,0 
0,1 

GX  =  1,0<G,< 
y/N-1,  0 

Vn-i,  %/n-\ 


Gx  Gy 


0  <  Px  <VN  -  1,  0  <  Pv  < 


0  <  P,  <  y/N  -  h  0  <  Py  <  VN 
VN-1,  y/N-1 


P,  =  0,  0  <  Pv  <  \?N 

0  <  P*,P*_<  Vn 
0  <  Pt  <  vW,  G„  =  0 


Thenrem  4.10.2  The  OTIS-Mesh  data  concentration  algorithm  described  above  is  op- 
timal for  both  the  SIMD  and  MIMD  models;  that  is,  (a)  every  SIMD  concentration 
algorithm  must  make  7(y/N  -  1)  electronic  and  2  OTIS  moves  in  the  worst  case, 
and  (b)  every  MIMD  concentration  algorithm  must  make  4(\/N  -  1)  electronic  and 
2  OTIS  moves. 

Proof  (a)  Suppose  that  the  data  to  be  concentrated  are  in  the  processors 
shown  in  Table  4.1.  Let  a  denote  processor  (y/N  -  1,  y/N  -  1,  \/N  -  1,  y/N  -  1),  let 
b  denote  processor  (y/N  -  1, 0,  yfN  -  1, 0),  and  let  c  denote  processor  (0,1,0,0).  The 
ranks  of  o,  b,  and  c  are  7V3/2,  ^-N+y/N-l,  and  N-y/N  respectively.  Therefore, 
following  the  concentration  the  data  D{a),  D(b),  and  D(c)  initially  in  processors  o, 
6,  and  c  will  be  in  processors  (0,1,0,0),  (0,  y/N  -  1,0,  v^-  1),  and  (Q,0,\/N  -  1,0) 
respectively.  Figure  4.1  shows  the  initial  and  concentrated  data  layout  for  the  case 
when  N  =  16.  The  change  in  Gx,  Gy,  P„  and  P,  values  between  the  final  and  initial 
locations  of  D(a),  D(b),  and  D(c)  is  shown  in  Table  4.2. 

The  maximum  net  negative  change  in  each  of  Gx,  Gy,  Px,  and  P,  is -(y/N-l). 
Since  a  net  negative  change  in  Gx  can  only  be  overlapped  with  a  net  negative  change 
in  Px  and  since  D(b)  needs  -{y/N -  1)  negative  change  in  both  Gx  and  P„  we  must 
make  at  least  2{y/N  - 1)  electronic  moves  that  decrease  the  row  index  within  a  mesh. 
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Figure  4.1.  Data  Configuration:  (a)  Initial;  (b)  Concentrated 
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Table  4.2.  Net  change  in  Gx,  Gy,  P„  and  P„ 


data 

Gx 

ft 

D(a) 

~(VN-l)  +  l 

-UN -I) 

-(VN-1) 

-UN -I) 

D{b) 

-UN -I) 

+UN-1) 

-UN -I) 

+UN-1) 

D(c) 

0 

0 

HVN-l) 

0 

Similarly,  because  of  £>(a)'s  requirements,  at  least  2(y/N  -  1)  electronic  moves  that 
increase  the  column  index  within  a  x  s/N  mesh  must  be  made.  Turning  our 
attention  to  net  positive  changes,  we  see  that  because  of  D(6)'s  requirements  there 
must  be  at  least  2{</N  -  1)  electronic  moves  that  increase  the  column  index.  D(c) 
requires  </N  -  1  electronic  moves  that  increase  the  row  index.  Since  positive  net 
moves  cannot  be  overlapped  with  negative  net  moves,  and  since  net  moves  along  Gx 
and  Px  cannot  be  overlapped  with  net  moves  along  G,  and  P„  the  concentration  of 
the  configuration  of  Table  4.1  must  take  at  least  7{y/N  -  1)  electronic  moves. 

In  addition  to  7{y/N  -  1)  electronic  moves,  we  need  at  least  2  OTIS  moves 
to  concentrate  the  data  of  Table  4.1.  To  see  this  consider  the  data  initially  in  group 
(0,1).  These  data  are  in  group  (0,0)  following  the  concentration.  At  least  one  OTIS 
move  is  needed  to  move  the  data  out  of  group  (0,1).  A  nontrivial  OTIS-Mesh  has 
>  2  processors  on  a  row  of  a  </N  x  y/N  submesh.  For  such  an  OTIS-Mesh,  at  least 
two  pieces  of  data  must  move  from  group  (0,1)  to  group  (0,0).  A  single  OTIS  move 
scatters  data  from  group  (0,1)  to  different  groups  with  each  datum  going  to  a  different 
group.  At  least  one  additional  OTIS  move  must  be  made  to  get  the  data  back  into 
the  same  group.  Therefore  the  concentration  of  the  configuration  of  Table  4.1  cannot 
be  done  with  fewer  than  2  OTIS  moves. 

(b)  Consider  the  initial  configuration  of  Table  4.1.  Since  the  shortest  path 
between  processor  6  and  its  destination  processor  is  4(y/N  -  1)  electronic  and  one 
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OTIS  move,  at  least  that  many  electronic  moves  are  made,  in  the  worst  case,  by 

every  concentration  algorithm.  The  reason  that  at  least  2  OTIS  mora  are  needed  to 

complete  the  concentration  is  the  same  as  for  (a).  □ 

4  11  Distribute 

This  is  the  inverse  of  the  concentrate  operation  of  Section  4.10.  We  start  with 

pairs  (D0,d0),...,(D„dq),do<d1<  •••<<*,,  in  the  first  q  + 1  processors  0,1,...,? 

and  are  to  route  pair  (A,  *)  to  processor  4, 0  <  i  <  q.  The  algorithm  of  Section  4.10 

tells  us  how  to  start  with  pairs  (A,t)  in  processor      0  <  i  <  q  and  move  them  so 

that  A  is  in  t.  By  running  this  backwards,  we  can  start  with  A  in  i  and  route  it  to 

di.  The  complexity  of  the  distribute  operation  is  the  same  as  that  of  the  concentrate 

operation.  We  have  shown  that  the  concentrate  algorithm  of  Section  4.10  is  optimal; 

it  follows  that  the  distribute  algorithm  is  also  optimal. 

4  19  Oeneralize 

We  start  with  the  same  initial  configuration  as  for  the  distribute  operation. 
The  objective  is  to  have  A  in  all  processors  j  such  that  (U  <  j  <  4+1  (set  dj+i  to 
N2  -  1).  If  we  simulate  the  4D  mesh  algorithm  for  generalize  using  the  simulation 
strategy  of  Zane  et  al.  [58],  it  takes  S(VN- 1)  electronic  and  8{VN- 1)  OTIS  moves 
to  perform  the  generalize  operation  on  an  SIMD  OTIS-Mesh.  We  can  improve  this 
to  8{VN  -  1)  electronic  and  2  OTIS  moves  if  we  run  the  generalize  algorithm  of 
Nassimi  and  Sahni  [35]  adapted  to  use  OTIS  moves  as  necessary.  The  outer  loop  of 
the  algorithm  of  Nassimi  and  Sahni  [35]  examines  processor  index  bits  from  2p  -  1 
to  0  where  p  =  log2  N.  So  in  the  first  p  iterations  we  are  moving  along  bits  of  the  G 
index  and  in  the  last  p  iterations  along  bits  of  the  P  index.  On  an  OTIS-Mesh  we 
would  break  this  into  two  parts  as  below: 

Step  1:  Perform  an  OTIS  move. 
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Step  2:  Run  the  GENERALIZE  procedure  of  Nassimi  and  Sahni  [35]  from  bit  p  -  1 
to  0,  while  maintaining  the  original  index. 

Step  3:  Perform  an  OTIS  move. 

Step  4:  Run  the  GENERALIZE  algorithm  of  Nassimi  and  Sahni  [35]  from  bit  p  -  1 
to  0. 

On  an  MIMD  OTIS-Mesh  the  above  algorithm  takes  4(>/3V  -  1)  electronic 
and  2  OTIS  moves. 

We  can  reduce  the  SIMD  complexity  to  7(VN  -  1)  electronic  and  2  OTIS 
moves  by  using  a  better  algorithm  to  do  the  generalize  operation  on  a  2D  SIMD 
mesh.  This  algorithm  uses  the  same  observation  as  used  by  us  in  Section  4.10  to 
speed  the  2D  SIMD  mesh  concentrate  algorithm;  that  is,  of  the  four  possible  move 
directions,  only  three  are  possible.  When  doing  a  generalize  on  a  2D  y/N  x  \/N  mesh 
the  possible  move  directions  for  data  are  to  increasing  row  indexes  and  to  decreasing 
and  increasing  column  indexes.  With  this  observation,  the  algorithm  to  generalize 
on  a  2D  mesh  becomes: 

Step  1:  Move  data  along  columns  to  increasing  row  indexes  if  the  data  is  needed  in 
a  row  with  higher  index. 

Step  2:  Move  data  along  rows  to  increasing  column  indexes  if  the  data  is  needed  in 
a  processor  in  that  row  with  higher  column  index. 

Step  S:  Move  data  along  rows  to  decreasing  column  indexes  if  the  data  is  needed  in 
a  processor  in  that  row  with  smaller  column  index. 

The  correctness  of  the  preceding  generalize  algorithm  can  be  established  using 
the  argument  of  Theorem  4.10.1,  and  its  optimality  follows  from  Theorem  4.10.2 
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and  the  fact  that  the  distribute  operation,  which  is  the  inverse  of  the  concentrate 
operation,  is  a  special  case  of  the  generalize  operation. 

The  new  and  more  efficient  generalize  algorithm  may  be  used  in  Step  2  of  the 
OTIS-Mesh  generalize  algorithm.  It  cannot  be  used  in  Step  4  because  the  generalize 
of  this  step  requires  the  full  capability  of  the  code  of  Nassimi  and  Sahni  [35]  which 
permits  data  movement  in  all  four  directions  of  a  mesh. 

When  we  use  the  new  generalize  algorithm  for  Step  2  of  the  OTIS-Mesh  gener- 
alize algorithm,  we  can  perform  a  generalize  on  a  SIMD  OTIS-Mesh  using  7(VN-1) 
electronic  and  2  OTIS  moves.  The  new  algorithm  is  optimal  for  both  SIMD  and 
MIMD  models.  This  follows  from  the  lower  bound  on  a  concentrate  operation  es- 
tablished in  Theorem  4.10.2  and  the  observation  made  above  that  the  distribute 
operation,  which  is  a  special  case  of  the  generalize  operation,  is  the  inverse  of  the 

concentrate  operation  and  so  has  the  same  lower  bound. 

4.13  Sorting 

As  was  the  case  for  the  operations  considered  so  far,  an  0(y/N)  time  algorithm 
to  sort  can  be  obtained  by  simulating  a  similar  complexity  4D  mesh  algorithm.  For 
sorting  a  4D  Mesh,  the  algorithm  of  Kunde  [25]  is  the  fastest.  Its  simulation  will  sort 
into  snake-like  row-major  order  using  Uy/N+o{y/N)  electronic  and  12v^ +o{y/N) 
OTIS  moves  on  the  SIMD  model  and  7^  +  o(y/N)  electronic  and  6VN  +  o{y/N) 
OTIS  moves  on  the  MIMD  model.  To  sort  into  row-major  order,  additional  moves  to 
reverse  alternate  dimensions  are  needed.  This  means  that  an  OTIS-Mesh  simulation 
of  Kunde's  4D  mesh  algorithm  to  sort  into  row-major  order  will  take  I8y/N  +  o(y/N) 
electronic  and  \6y/N  +  o(>/N)  OTIS  moves  on  the  SIMD  model.  We  show  that 
Leighton's  column  sort  [26]  can  be  implemented  on  an  OTIS-Mesh  to  sort  into  row- 
major  order  using  22y/N+o(y/N)  electronic  and  0(7V3/8)  OTIS  moves  on  the  SIMD 
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Figure  4.2.  Row-Column  Transformation  of  Leighton's  Column  Sort 

model  and  l\VN+o(y/N)  electronic  and  0{N3'*)  OTIS  moves  on  the  MIMD  model. 
Please  note  that  the  algorithm  discussed  here  is  deterministic.  The  randomized 
algorithms  for  sorting  can  be  found  in  Rajasekaran  and  Sahni  [41]. 

Our  OTIS-Mesh  sorting  algorithm  is  based  on  Leighton's  column  sort  [26]. 
This  sorting  algorithm  sorts  an  r  x  s  array,  with  r  >  2(s  -  l)2,  into  column-major 
order  using  the  following  seven  steps: 

Step  1:  Sort  each  column. 

Step  2:  Perform  a  row-column  transformation. 

Step  3:  Sort  each  column. 

Step  4:  Perform  the  inverse  transformation  of  Step  2. 

Step  5:  Sort  each  column  in  alternating  order. 

Step  6:  Apply  two  steps  of  comparison-exchange  to  adjacent  rows. 

Step  7:  Sort  each  column. 

Figure  4.2  shows  an  example  of  the  transformation  of  Step  2,  and  its  inverse. 
Figure  4.3  shows  a  step  by  step  example  of  Leighton's  column  sort. 
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Figure  4.3.  Example  of  Leighton's  Column  Sort 

Although  Leighton's  column  sort  is  explicitly  stated  for  r  x  s  arrays  with 
r  >  2(s  -  l)2,  it  can  be  used  to  sort  arrays  with  s  >  2(r  -  l)2  into  row-major 
order  by  interchanging  the  roles  of  rows  and  columns.  We  shall  do  this  and  use 
Leighton's  method  to  sort  an  N1*7  x  N3'2  array.  We  interpret  our  N2  OTIS-Mesh  as 
an  AT1'2  x  TV3/2  array  with  Gx  giving  the  row  index  and  GyPxPy  giving  the  column 
index  of  an  element  processor.  We  shall  further  subdivide  Gx  (G„  Px,  Py)  into 
equal  parts  G,,,  GSl,  Gx„  and  G9A  from  left  to  right.  We  use  G„_4,  for  example, 
to  represent  GXlGx,G^.  Since  p  =  log2  AT,  Gx  has  p/2  bits  and  G^  has  p/8  bits. 
These  notations  are  helpful  in  describing  the  transformations  in  Steps  2  and  4  of  the 
column  sort,  as  we  use  the  BPC  permutations  of  Nassimi  and  Sahni  [34]  to  realize 
these  transformations.  The  definition  of  a  BPC  permutation  can  be  found  in  Nassimi 
and  Sahni  [34]  and  in  Section  3.8.1. 

In  describing  our  sorting  algorithm,  we  shall,  at  times,  use  a  4D  array  inter- 
pretation of  an  OTIS-Mesh.  In  this  interpretation,  processor  (G„  Gy,  Px,  Py)  of  the 
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OTIS-Mesh  corresponds  to  processor  (G„  G„  P.,  J»,)  of  the  4D  mesh.  We  use  gx  to 
denote  the  bit  positions  of  Gx,  that  is  the  leftmost  p/2  bits  in  a  processor  index,  gXl 
to  represent  the  leftmost  p/8  bit  positions,  p,  to  represent  the  rightmost  p/2  bit  posi- 
tions, pw_„  to  represent  the  rightmost  p/4  bit  positions,  and  so  on.  Our  strategy  for 
the  sorting  steps  1,  3,  5,  and  7  of  Leighton's  method  is  to  collect  each  row  (recall  that 
since  we  are  sorting  an  N1*2  x  N"7  array,  the  column-sort  steps  of  Leighton's  method 
become  row-sort  steps)  of  our  N*<*  x  array  into  an  N">  x  JV3/8  x  N>»  x  4D 
submesh  of  the  OTIS-Mesh,  and  then  sort  this  row  by  simulating  the  4D  mesh  sort 
algorithm  of  Kunde  [25].  This  strategy  translates  into  the  following  sorting  algorithm: 

Step  1:  [Move  rows  of  the         x         array  into  N**  X  N>»  x  N>»  x  N**  4D 
submeshes] 

Perform  the  BPC  permutation  P.  =  [y*,$,nP*,Py,0xa0w_40*»P*2-40*4Pi*-<]- 

Step  2:  [Sort  each  row  of  the  Nl/7  x  N**  array] 

Sort  each  4D  submesh  of  size  TV3'8  x  N*"  x  N*'*  x  JV3/8. 

Step  S:  [Do  the  inverse  of  Step  1,  perform  a  column-row  transformation,  and  move 
rows  into  N3'*  x  N*'*  x  N***  x  N*'*  submeshes] 

Perform  the  BPC  permutation  Pc  =  [0x,-«0x10ittP*l-4WW40V40yiP*iPvi]- 

Step  4:  [Sort  each  row  of  the  N1*2  x  N*P  array] 

Sort  each  4D  submesh  of  size  JV3'8  x  AT3/8  x  N*/*  x  TV3/8. 

Step  5:  [Do  the  inverse  of  Step  1,  perform  a  row-column  transformation,  and  move 
rows  into  W3/8  x  7V3/8  x  N3'*  x  N*'*  submeshes] 

Perform  the  BPC  permutation  P^  =  tex^xj.jPy^PxiPy^ira^vi^PiMPxj^]- 
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Step  6:  [Sort  each  row  in  alternating  order] 

Sort  each  4D  submesh  of  size  iW»  x  N3*  x  N3'*  x  AT3/8. 

Step  7:  [Move  rows  back  from  4D  submeshes] 

Perform  the  BPC  permutation     =  [gXlM*iPin9x,9vi-49xsPxi-A9x<Pv2-*}- 

Step  8:  Apply  two  steps  of  comparison-exchange  to  adjacent  rows. 

Step  9:  [Move  rows  into  submeshes  of  size  N3'*  x  N3'*  x  iV3/8  x  N3'*] 

Perform  the  BPC  permutation  P.  =  [5xi0yiPxiPyi0xj5yj-<5xsPx2-«5x4Pv2-J- 

Step  10:  [Sort  each  row  of  the  N1'7  x  N3*7  array] 

Sort  each  4D  submesh  of  size  N3'*  x  N3'*  x  N3'*  x  N3^6. 

Step  11:  [Move  rows  back  from  4D  submeshes] 

Perform  the  BPC  permutation  J*  =  [gXi9mP*iPvi9x79w2-*9x»Px,-*9x<Py,-i]- 

Notice  that  the  row  to  4D  submesh  transform  is  accomplished  by  the  BPC 
permutation  P.  =  ^Mx.Mx^.^Px^*^-^  Elements  in  the  same  row  of 
our  Nl/7  x  N3'7  array  interpretation  have  the  same  Gx  value;  but  in  our  4D  mesh 
interpretation,  elements  in  the  same  N3'*  x  N3^  x  N3'*  x  N3'*  submesh  have  the 
same  G^G^.P^  value.  P.  results  in  this  property.  To  go  from  Step  2  to  Step 
3  of  Leighton's  method,  we  need  to  first  restore  the  N1'7  x  N3^  array  interpreta- 
tion using  the  inverse  permutation  of  P.,  that  is,  perform  the  BPC  permutation 
K  =  \9x19ViPxlP1n9x,9Vr-<9xiPxt-.9ztPv,-.]\  then  perform  a  column-row  transform  us- 
ing BPC  permutation  p*  =  [s,P*P,,0*];  and  finally  map  the  rows  of  our  N1'7  x  TV3/2 
array  into  4D  submeshes  of  size  N3/*  x  N3'*  x  N3'*  x  N3/*  using  the  BPC  permutation 
P0.  The  three  BPC  permutation  sequence  P^P»P.  is  equivalent  to  the  single  BPC 
permutation  Pe  =  \9xi-i9xi9)nP**-*9v»PYi-A9v49yiP*iPvi]- 
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The  preceding  OTIS-Mesh  implementation  of  column  sort  performs  6  BPC 
permutations,  4  4D  mesh  sorts,  and  two  steps  of  comparison-exchange  on  adjacent 
rows.  Since  the  sorting  steps  take  0(N>«)  time  each  (use  KunoVs  4D  mesh  sort 
[25]  followed  by  a  transform  from  snake-like  row-major  to  row-major),  and  since  the 
remaining  steps  take  0(N^)  time,  we  shall  ignore  the  complexity  of  the  sort  steps. 

We  can  reduce  the  number  of  BPC  permutations  from  6  to  3  as  follows.  First 
note  that  the  Pa  of  Step  1  just  moves  elements  from  rows  of  the  N1^2  x  N*'7  array 
into  N3/6  x  AT3/8  x  AT3/8  x  TV3/8  4D  submeshes.  For  the  sort  of  Step  2,  it  doesn't  really 
matter  which  TV3/2  elements  go  to  each  4D  submesh  as  the  initial  configuration  is  an 
arbitrary  unsorted  configuration.  So  we  may  eliminate  Step  1  altogether.  Next  note 
that  the  BPC  permutations  of  Steps  7  and  9  cancel  each  other  and  we  can  perform  the 
comparison-exchange  of  Step  8  by  moving  data  from  one  AT3/8  x  N**  x  AT3/8  x  N*» 
4D  submesh  to  an  adjacent  one  and  back  in  0(JV3/8)  time. 

With  these  observations,  the  algorithm  to  sort  on  an  OTIS-Mesh  becomes: 

Step  1:  Sort  in  each  subarray  of  size  A*/8  x  AT3/8  x  N3^  x  A*/8 

Step  2:  Perform  the  BPC  permutation  Pc. 

Step  8:  Sort  in  each  subarray. 

Step  4:  Perform  the  BPC  permutation  /*. 

Step  5:  Sort  in  each  subarray. 

Step  6:  Apply  two  steps  of  comparison-exchange  to  adjacent  subarrays. 

Step  7:  Sort  in  each  subarray. 

Step  8:  Perform  the  BPC  permutation  P^. 
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Using  the  BPC  routing  algorithm  of  Section  3.8.2,  the  three  BPC  permu- 
tations can  be  done  using  36v^  electronic  and  31og2N  +  6  OTIS  moves  on  the 
SIMD  model  and  18y/N  electronic  and  31og27V*  +  6  OTIS  moves  on  the  MIMD 
model.  A  more  careful  analysis  based  on  the  development  in  Nassimi  and  Sahni 
[34]  and  Section  3.8.2  reveals  that  the  permutations  P^,  Pc,  and  can  be  done 
with  28v^V  electronic  and  log2  AT  +  6  OTIS  moves  on  the  SIMD  model  and  Uy/N 
electronic  and  31og2tf  +  6  OTIS  moves  on  the  MIMD  model.    By  using  p*.  = 

[9Xi9v1Px1Pyi9X29v2Px2Pv79xs9yzPziPys9xi9vJ>^Pyt]>  Pc  =  [gx7-.9xi9V3-i9y,Px^PxiP1n-*Py,] 

and  j/e  =  \<},<9x1-,9y<9yi-JxJ>x>-,Py<P»-t),  the  permutation  cost  becomes  22^ 

electronic  and  log2  AT  +  5  OTIS  moves  on  the  SIMD  model  and  llv^V  electronic 

and  log2  N  +  5  OTIS  moves  on  the  MIMD  model.  The  total  number  of  moves  is 

thus  22y/N  +  0(NV*)  electronic  and  0{N*l*)  OTIS  moves  on  the  SIMD  model  and 

Uy/N  +  OiN3'*)  electronic  and  0{N3^)  OTIS  moves  on  the  MIMD  model.  This 

is  superior  to  the  cost  of  the  sorting  algorithm  that  results  from  simulating  the  4D 

row-major  mesh  sort  of  Kunde  [25]. 

d  14   Random  Access  Read  (RAR) 

In  a  random  access  read  (RAR)  [42]  processor  /  wishes  to  read  data  variable 
D  of  processor  d/,  0  <  J  <  JV2.  The  steps  suggested  in  Ranka  and  Sahni  [42]  for  this 
operation  are  as  follows: 

Step  0:  Processor  /  creates  a  triple  (/,  D,dj)  where  D  is  initially  empty. 
Step  1:  Sort  the  triples  by  d/. 

Step  2:  Processor  /  checks  processor  /  +  1  and  deactivates  if  both  have  triples  with 
the  same  third  coordinate. 

Step  3:  Rank  the  remaining  processors. 
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Step  4:  Concentrate  the  triples  using  the  ranks  of  Step  3. 

Step  5:  Distribute  the  triples  according  to  their  third  coordinates. 

Step  6:  Load  each  triple  with  the  D  value  of  the  processor  it  is  in. 

Step  7:  Concentrate  the  triples  using  the  ranks  in  Step  3. 

Step  8:  Generalize  the  triples  to  get  the  configuration  we  had  following  Step  1. 

Step  9:  Sort  the  triples  by  their  first  coordinates. 

Using  the  SIMD  model  the  RAR  algorithm  of  Ranka  and  Sahni  [42]  takes 

79(v/tf  -  1)  electronic  moves  and  0{N*/S)  OTIS  moves.  On  the  MIMD  model,  it 

takes  45(v/^~  1)  electronic  0(N3'*)  OTIS  moves. 

4  1 5    Random  Access  Write  (RAW) 

Now  processor  /  wants  to  write  its  D  data  to  processor  dj,  0  <  J  <  N*.  The 
steps  in  the  RAW  algorithm  of  Ranka  and  Sahni  [42]  are  as  follows: 

Step  0:  Processor  /  creates  the  tuple  (£>(/),<*/),  0  <  /  <  N7. 
Step  1:  Sort  the  tuples  by  their  second  coordinates. 

Step  2:  Processor  /  deactivates  if  the  second  coordinate  of  its  tuple  is  the  same  as 
the  second  coordinate  of  the  tuple  in/  +  l,  Q<I  <N*  —  1. 

Step  S:  Rank  the  remaining  processors. 

Step  4:  Concentrate  the  tuples  using  the  ranks  of  Step  3. 

Step  5:  Distribute  the  tuples  according  to  their  second  coordinates. 
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Step  2  implements  the  arbitrary  write  method  for  a  concurrent  write.  In 
this,  any  one  of  the  processors  wishing  to  write  to  the  same  location  is  permitted 
to  succeed.  The  priority  model  may  be  implemented  by  sorting  in  Step  1  by  d,  and 
within  dj  by  priority.  The  common  and  combined  models  can  also  be  implemented, 
but  with  increased  complexity. 

On  the  SIMD  model,  an  RAW  takes  43(v^- 1)  electronic  and  0(N3^)  OTIS 
moves  while  on  the  MIMD  model,  it  takes  26(^-1)  electronic  and  0(N3'*)  OTIS 
moves. 

41fi  Summary 

Our  algorithms  run  faster  than  the  simulation  of  the  fastest  algorithms  known 
for  4D  meshes.  Tables  4.3  and  4.4  summarizes  the  complexities  of  our  algorithms  and 
those  of  the  corresponding  ones  obtained  by  simulating  the  best  4D-mesh  algorithms 
on  SIMD  and  MIMD  models  respectively.  Note  that  the  worst  case  complexities  are 
listed  for  the  broadcast  and  window  broadcast  operation,  and  that  of  the  case  when 
is  even  is  presented  for  the  data  sum  operation  on  the  MIMD  model.  Also,  the 
complexities  listed  for  circular  shift,  data  accumulation,  and  adjacent  sum  assume 
that  the  shift  distance  is  <  VN/2  on  the  MIMD  model.  Both  tables  give  only  the 
dominating  y/N  terms  for  sorting.  Our  algorithms  for  data  broadcast,  data  sum, 
concentrate,  distribute,  and  generalize  are  optimal. 
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Table  4.3.  Comparison  of  complexities  on  SIMD  model 


Operation 

Simulation 

Ours 

Electronic 

OTIS 

Electronic 

OTIS 

Broadcast 

4(VN  -  1) 

4(VN  -  1) 

4(\A/V  -  1) 

1 

Window  Broadcast 

4\fN  -2w-2 

4(y/N  -  1) 

AyfN  -2w-2 

2 

Prefix  Sum 

7(VN  -  1) 

6(SN  -  1) 

7(VN  -  1) 

2 

Data  Sum 

8(^-1) 

A(y/N-l) 

B(VN-l) 

1 

Rank 

7(VN-l) 

6(y/N-l) 

nvN-i) 

2 

Regular  Shift 

8 

2s 

s 

2 

Circular  Shift 

2VN 

y/N 

2 

Data  Accumulation 

Vn 

2y/N 

y/N 

2 

Consecutive  Sum 

2(M  -  1) 

4(M  -  1) 

2(M-1) 

2 

Adjacent  Sum 

y/N 

2y/N 

y/N 

2 

Concentrate 

8(VN  - 1) 

&(VN  - 1) 

7(y/N-\) 

2 

Distribute 

B(VN-\) 

B(VN-l) 

7(VN  -  1) 

2 

Generalize 

S(VN  -  1) 

KVN-l) 

7(VN  -  1) 

2 

Sorting 

Uy/N 

\2sfN 

22y/N 

Table  4.4.  Comparison  of  complexities  on  MIMD  model 


Operation 

Simulation 

Ours 

Electronic 

OTIS 

Electronic 

oris 

Broadcast 

A{y/N  -  1) 

4(^-1) 

4(V^V-  1) 

1 

Window  Broadcast 

Ay/N  -2w-2 

4(^-1) 

4yfN  -2w-2 

2 

Prefix  Sum 

7(VN-l) 

6(VN-1) 

7(VN-1) 

2 

Data  Sum 

WN 

WN 

WN 

1 

Flank 

7(VN-1) 

6(VN-l) 

7(VN-1) 

2 

Regular  Shift 

8 

2s 

s 

2 

Circular  Shift 

8 

2s 

s 

2 

Data  Accumulation 

M 

M 

M 

2 

Consecutive  Sum 

M  - 1 

2{M-\) 

M  - 1 

2 

Adjacent  Sum 

M 

2M 

M 

2 

Concentrate 

4(VN  -  1) 

\WN-\) 

4(VN  - 1) 

2 

Distribute 

4(^-1) 

A(y/N  -  1) 

A(y/N-1) 

2 

Generalize 

A(y/N  - 1) 

4(vW-l) 

4(VN-1) 

2 

Sorting 

7VN 

6v^ 

lWN 

0(AT3/8) 

CHAPTER  5 
MATRIX  MULTIPLICATIONS  ON  AN  OTIS-MESH 

In  this  chapter,  we  develop  algorithms  to  multiply  vectors  of  size  kN  and 
matrices  of  siie  kN  x  kN  on  an  TV2  processor  OTIS-Mesh.  These  algorithms  are 
developed  for  both  of  the  matrix  to  OTIS-Mesh  mapping  schemes  considered  in  Sec- 
tion 2.3— group  row-major  mapping  (GRM)  and  group  submatrix  mapping  (GSM). 
We  begin,  in  Section  5.1,  by  describing  the  GRM  and  GSM  schemes  and  making 
observations  about  the  complexity  of  performing  the  matrix  add  and  transpose  op- 
erations. In  Section  5.2,  we  develop  the  algorithms  for  various  versions  of  vector  and 

matrix  multiplication. 

For  purposes  of  this  chapter  the  essential  differences  between  electronic  and 
optical  links  are  (ty  optical  links  have  much  larger  bandwidth  than  do  electronic  links; 
and  (b)  transfer  times  including  latency  are  different  on  optical  and  electronic  links. 
In  our  analysis,  we  count  communication  along  electronic  and  optical  interconnects 
separately.  However,  we  use  the  simplifying  assumption  that  any  constant  amount  of 
data  can  be  communicated  over  an  optical  link  during  an  optical  communication  step 
while  only  a  unit  amount  of  data  can  be  communicated  over  an  electronic  link  during 
an  electronic  communication  step.  In  this  chapter,  we  assume  that  the  processor  mesh 
that  represents  any  group  of  processors  is  a  SIMD  mesh.  Therefore,  in  any  given  time 
step,  data  can  be  moved  in  only  one  of  the  four  mesh  dimensions:  up,  down,  left,  or 
right.  Extensions  to  MIMD  meshes  are  straightforward  and  thus  omitted. 

§J    Mapping  Matrices  Onto  An  OTIS-Mesh 

In  Section  2.3  we  described  the  GRM  and  GSM  mapping  of  a  matrix.  For  the 
GSM  mapping,  we  introduce  the  following  notation.  Matrix  element  (»,  j)  is  mapped 
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to  processor  (i/.i/.Wm)  where  if  =  [i/v^J,  i*  =  •  mod  V^,  jf  =  U/v^J,  and 
Jm  =  j  mod  v^AT- 

The  GRM  and  GSM  mappings  of  a  row  or  column  vector  are  obtained  from 
the  corresponding  mapping  of  an  N  x  N  matrix  by  extracting  the  sub-mapping  corre- 
sponding to  row  zero  or  column  zero  of  the  matrix.  The  GRM  and  GSM  mappings  of 
akNxkN  matrix  are  obtained  by  partitioning  the  kN x kN  matrix  into  NxN  blocks 
of  size  k  x  Jfc  each.  The  N  x  N  block  matrix  is  then  mapped  onto  the  N9  processor 
OTIS-Mesh,  one  block  per  processor,  using  the  standard  GRM  and  GSM  schemes 
described  above.  1  x  kN  and  kN  x  1  vectors  are  mapped  by  using  the  sub-mapping 
corresponding  to  row  zero  or  column  zero  of  the  kN  x  kN  matrix  mapping. 

It  is  easy  to  see  that  regardless  of  which  mapping  is  used,  matrix  as  well 
as  vector  addition  and  subtraction  requires  no  interprocessor  communication.  Two 
kN  x  kN  matrices  can  be  added  or  subtracted  in  0(k7)  time  and  two  vectors  of  size 
kN  can  be  added  or  subtracted  in  0(k)  time. 

Algorithms  for  the  matrix  transpose  operation  were  developed  in  Section  3.1. 
A  kN  x  kN  matrix  can  be  transposed  using  a  single  OTIS  move  and  no  electronic 
moves  when  the  GRM  mapping  is  used.  When  the  GSM  mapping  is  used,  the  trans- 
pose requires  &k2(y/N  -  1)  electronic  and  2  OTIS  moves.  In  either  case,  0(k2) 
intraprocessor  moves  are  needed  to  transpose  the  *  x  k  block  stored  in  a  processor. 

S.2   Multiplication  Algorithm 
S  2.1    Column  Vector  x  Row  Vector 

GEM 

First  consider  the  GRM  mapping.  When  an  N  x  1  column  vector  A  and  a 
\  x  N  row  vector  B  are  multiplied,  the  result  is  an  N  x  N  matrix.  This  is  to  be 
stored  in  the  OTIS-Mesh  using  the  GRM  mapping. 
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Step  1:  Perform  an  OTIS  move  on  B. 

Step  2:  Broadcast  the  A  and  B  data  in  each  group  to  all  processors  of  the  group. 

Step  3:  Perform  an  OTIS  move  on  B. 

Step  4:  Each  processor  multiplies  its  A  and  B  data. 

Figure  5.1.  GRM  Column  x  Row  Multiplication 

Initially,  the  element  A  in  row  i  of  A  is  in  the  Oth  processor  of  group  i  (i.e., 
processor  (i,0))  and  the  jth  element  Bj  of  B  is  in  processor  (0,;).  Following  the 
multiplication,  the  (»,  j)  element  A+Bj  of  the  product  matrix  is  to  be  in  processor 

The  four  step  algorithm  given  in  Figure  5.1  performs  the  multiplication. 

Following  Step  1,  Bj  is  in  processor  (j',0),  0  <  j  <  N;  and  following  the 
broadcast  of  Step  2,  processor  (i,  *)  has  Ai  (*  denotes  all  permissible  indexes;  in  this 
case  indexes  are  in  the  range  [0,  N))  and  processor  (|,  *)  has  Bj.  After  the  OTIS  move 
of  Step  3,  processor  (i,  j)  has  A  and  Bj.  Consequently,  following  the  multiplication 
of  Step  4,  processor  (i,  j)  has  the  (i,j)th  entry  of  the  result  matrix.  Therefore,  the 
algorithm  correctly  multiplies  the  vectors  A  and  B. 

For  the  complexity  analysis,  we  see  that  2  OTIS  moves  are  made  in  Steps  1 
and  3  together.  Step  2  can  be  done  using  2y/N  electronic  moves  by  first  sending 
A  and  B  data  initially  in  processor  0  of  a  group  down  column  zero  of  that  group. 
Since  only  one  piece  of  data  can  be  moved  at  a  time  along  an  electronic  link,  this 
column  broadcast  of  the  A  and  B  data  can  be  done  in  y/N  moves  if  we  pipeline  the 
data  movement  down  column  zero  (i.e.,  the  B  data  trail  the  A  data  by  one  column 
processor).  Next  the  A  and  B  data  in  each  processor  of  column  0  are  broadcast 
along  rows  using  a  similar  pipelining.  This  requires  another  y/N  electronic  moves. 
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The  complexity  of  our  GRM  column  x  row  algorithm  is  therefore  2^  electronic 
and  2  OTIS  moves. 

Thp.nrp.rn  5.2.1  Our  column  x  row  algorithm  is  an  optimal  algorithm. 

Proof  To  see  this,  first  note  that  all  the  B  values  are  initially  in  group  0,  and  all 
need  to  get  to  group  1  (say)  either  in  the  form  of  AxBj  or  simply  By  The  only 
way  data  can  move  from  one  group  to  another  is  via  an  OTIS  move,  and  a  single 
OTIS  move  can  only  move  a  constant  number  of  the  Bfs  accumulated  into  a  single 
processor  of  group  0  into  a  single  processor  of  group  1.  Therefore,  at  least  2  OTIS 
moves  are  needed. 

Also,  2^  electronic  moves  are  necessary.  To  see  this,  observe  that  B0  is 
initially  in  processor  (0,0)  and  its  influence  must  be  seen  at  all  processors  (*,0) 
because  the  (*,0)  element  of  the  result  is  A.A,,  which  is  to  be  left  in  processor  (*,0). 
OTIS  moves  can  only  transpose  group  and  local  processor  indexes.  To  affect  a  change 
from  (0,0)  to  (*,0),  2y/N  -  2  electronic  moves  {y/N  -  1  rightward  row  moves  and 
y/N-1  downward  column  moves)  are  essential.  Further,  Aq  is  initially  in  (0,0)  and 
all  values  in  (0,  *)  depend  on  Ac.  Therefore  at  least  y/N- 1  rightward  row  moves  and 
y/N  -  1  downward  column  moves  are  needed  to  communicate  the  Aq  value  directly 
or  indirectly  to  (0,*).  Since  only  unit  data  can  flow  along  an  electronic  link  in  a 
single  move,  we  cannot  overlap  all  of  the  rightward  row  moves  needed  for  Aq  and 
BQ.  Therefore,  at  least  y/N  rightward  moves  must  be  made.  Similarly  at  least  y/N 
downward  moves  must  be  made.  □ 

The  algorithm  of  Figure  5.1  also  can  be  used  when  A  is  a  kN  x  1  vector  and 
B  a  1  x  kN  vector.  Now,  in  Steps  1  and  3,  blocks  of  k  values  are  moved  from  a  single 
processor  via  an  OTIS  move.  In  Step  2,  blocks  of  size  k  are  to  be  broadcast.  The 
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Step  1:  Processors  that  have  a  B  value  broadcast  this  B  value  to  all  processors  in 
the  same  column  of  the  group. 

Step  2:  Processors  that  have  an  A  value  broadcast  this  A  value  to  all  processors  in 
the  same  row  of  the  group. 

Step  3:  Perform  an  OTIS  move  on  the  A  and  B  values  in  a  processor. 
Step  4:  Same  as  Step  1. 
Step  5:  Same  as  Step  2. 
Step  6:  Same  as  Step  3. 

Step  7:  Each  processor  multiplies  its  A  and  B  values  to  produce  an  element  of  the 
product  matrix. 

Figure  5.2.  GSM  Column  x  Row  multiply  algorithm 

strategy  is  the  same  as  for  the  case  Jk  =  1;  however,  now  we  must  pipeline  the  2k  A 
and  B  values  in  a  processor  for  the  column  and  row  broadcast  steps.  This  pipelining 
takes  -  4  electronic  moves.  Steps  1  and  3  still  take  2  OTIS  moves  as  we  can 

move  Jfc  element  blocks  using  a  single  OTIS  move.  In  Step  4,  each  processor  performs 
k2  multiplications  to  generate  &  k  x  k  block  of  the  product  matrix. 

GSM 

For  A  an  N  x  1  vector  and  B  a  1  x  N  vector,  we  start  with  A,  in  processor 
(»'/,  0,^,0)  and  Bj  in  (0,i/,0,;m)  and  are  to  leave  the  product  term  Cy  =  ABj  in 
(*/,;"/,  Wm)-  The  algorithm  of  Figure  5.2  does  this. 

Step  1  moves  Bj  from  (0,j/,0,im)  to  (0,j/,  *,jm)  and  Step  2  moves  A  from 
(if,  0,^,0)  to  (i/,0,im,*).  Following  Step  3,  Bj  is  in  (*,jmAjf)  and  A  is  in 
(»m,*,*/,0).  Step  4  now  moves  Bj  from  (*,jm,0,jf)  to  (*,jm, and  Step  5 
moves  Ai  from  (im,  *,  if,  0)  to  (tm,  *, »'/,  *).  Following  Step  6,  processor  U^Jm) 
has  Ai  and  Bj.  Therefore,  Step  7  correctly  computes  the  product  element  dj  =  Aibj. 
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The  number  of  data  moves  is  A{VN  -  1)  electronic  and  2  OTIS  moves.  The 
algorithm  of  Figure  5.2  is  optimal  because  the  Aq  initially  in  (0,0,0,0)  affects  the  final 
value  in  (y/N  -  1,0,  y/N  -  1,0).  This  requires  2(y/N  -  1)  electronic  column  moves. 
Further,  the  B0  initially  in  (0,0,0,0)  affects  the  final  value  in  (0,  y/N  -  1, 0,  y/N  -  1). 
This  requires  2(y/N  -  1)  electronic  row  moves.  Additionally,  y/N  A  values  initially 
in  group  (0,0)  affect  the  final  values  in  group  (0,1).  This  requires  at  least  2  OTIS 
moves  (assume  that  y/N  >  2). 

The  algorithm  of  Figure  5.2  can  also  be  used  when  Jfc  >  1.  Now,  the  broadcasts 

and  each  OTIS  move  involves  2  blocks  of  Jfc  elements  each.  The  broadcasts  are  done 

by  pipelining  the  transfer  of  the  Jfc  elements  in  a  block  and  each  OTIS  move  simply 

does  a  block  transfer  of  the  Jfc  elements.  The  total  number  of  data  move  steps  becomes 

Ay/N  +  4*  -  8  electronic  and  2  OTIS.  Step  7  produces  a  k  x  k  block  of  the  result 

matrix  using  k7  multiplication  steps. 
5.2.2    Row  Vp<*nr  x  Column  Vector 
GRM 

For  a  1  x  TV  row  vector  A  and  an  N  x  1  column  vector  B,  we  begin  with  A, 
in  (0,  i)  and  in  (t,  0).  The  result  E^1 4$  is  to  be  left  in  (0,0).  In  the  algorithm 
of  Figure  5.3,  B<  is  moved  from  (t',0)  to  (0,i)  in  Step  1.  The  sum  E^1  A^  is 
computed  in  Step  3  by  first  moving  the  products  of  Step  2  upward  to  row  0  and 
adding  terms  in  the  row  zero  processors.  Then  the  partial  sums  are  moved  leftward 
along  row  zero  and  the  result  computed  in  (0,0).  The  algorithm  requires  1  OTIS  and 
2(\/N  -  1)  electronic  moves.  It  is  obvious  that  the  algorithm  is  optimal. 

When  the  vectors  are  of  size  1  xkN  and  kNx  1,  respectively,  Step  2  multiplies 
a  1  x  Jfc  block  of  A  with  a  Jfc  x  1  block  of  B.  This  takes  0(k)  time.  We  assume  that 
the  cost  of  O(Jfc)  arithmetics  is  considerably  less  than  the  cost  of  an  electronic  move 
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Step  i:  Perform  an  OTIS  move  on  B  values. 

Step  2:  Each  processor  of  group  0  multiplies  its  A  and  B  values. 

Step  3:  Sum  the  products  of  Step  2  by  columns  and  finally  along  row  zero,  leaving 
the  result  in  (0,0). 

Figure  5.3.  GRM  Row  x  Column  Multiply 

Step  1:  Groups  with  A  values  move  their  A  values  from  row  0  to  column  0  using  the 
data  paths  of  Figure  5.5. 

Step  2:  Perform  an  OTIS  move  on  all  data. 

Step  S-  Shift  the  A  values  leftward  along  row  0  of  a  group  and  the  B  values  upward 
along  column  0  and  compute  the  sum  of  products  in  the  (0,0)  processor  of  each 
group  that  has  A  and  B  values. 

Step  4:  Perform  an  OTIS  move  on  the  product  sums  computed  in  Step  3. 

Step  5:  Shift  the  product  sums  upward  along  column  0  of  group  0,  summing  these 
sums  in  processor  (0,0). 

Figure  5.4.  GSM  Row  x  Column  Multiply 

and,  therefore,  make  no  attempt  to  utilize  processors  from  other  groups  to  reduce 
the  time  spent  on  arithmetic  operations.  The  data  moves  required  by  the  algorithm 
of  Figure  5.3  still  are  2(VN  -  1)  electronic  and  1  OTIS. 
GSM 

When  multiplying  alxN  vector  and  an  N  x  1  vector  using  the  GSM  mapping, 
the  algorithm  of  Figure  5.4  can  be  used. 

The  algorithm  of  Figure  5.4  begins  with  Ai  in  (0,  t'/,  0,  tm)  and  Bi  in  (i/,  0,  im,  0). 
In  Step  1,  Ai  is  moved  to  (0,i/,»m,0)  by  performing  sfN  -  1  downward  moves  and 
y/N-l  leftward  moves  as  in  Figure  5.5.  Following  Step  2,  Ai  is  in  (tm,  0,0,1/)  and 
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Figure  5.5.  Data  Paths  Used  in  Step  1  of  Figure  5.4 

Bi  is  in  (t™,  O.t/,0).  In  Step  3,  (im,  0,0,0)  sums  up  of  the  terms  that  contribute 
to  the  result.  In  Step  4,  these  sums  are  moved  to  (0, 0,  im,  0)  and  are  added  together 
in  Step  5.  Steps  1  and  3  take  2(VN  -  1)  electronic  moves  each  and  Step  5  takes 
y/N  -  1  electronic  moves.  The  total  number  of  data  moves  is  therefore  5(\/]V  -  1) 
electronic  and  2  OTIS  moves. 

A  straightforward  generalization  of  the  algorithm  of  Figure  5.4  to  the  case 
when  we  care  multiplying  a  1  x  kN  row  with  a  kN  x  1  column  results  in  excessive 
complexity  when  k  >  1.  This  is  so  because  the  pipelining  of  Step  3  takes  2k(y/N- 1) 
electronic  moves.  When  k  >  1,  the  number  of  data  moves  is  reduced  by  using  the 
algorithm  of  Figure  5.6. 

The  algorithm  of  Figure  5.6  begins  with  the  ith  block  of  A  in  (0,t'/,0,im)  and 
the  ith  block  of  B  in  (i/,  0,  iro,  0).  In  Step  1,  the  ith  block  of  A  is  moved  to  (0,  i/,  im,  0). 
And  following  Step  2,  the  ith  block  of  A  is  in  (i^.O.O.i/)  while  the  ith  block  of  B 
is  in  (im,0,i/,0).  Step  3  moves  the  ith  block  of  A  from  (im,0,0,i/)  to  (im,0,i/,0). 
Now  (im,0,i/,0)  contains  block  i  of  A  and  B.  These  blocks  are  multiplied  in  Step 
4  to  produce  a  single  number  in  (tm,0,i/,0).  In  Step  5,  the  numbers  computed  in 
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Step  1:  Groups  with  A  values  move  their  A  value  blocks  from  row  0  to  column  0 
using  pipelining  and  the  data  paths  of  Figure  5.5. 

Step  2:  Perform  an  OTIS  move  on  all  data  blocks. 

Step  3:  Same  as  Step  1. 

Step  4:  Processors  with  an  A  and  a  B  block  multiply  their  blocks  (these  are  the 
column  0  processors  of  each  column  0  group). 

Step  5-  The  column  0  processors,  in  each  column  0  group,  shift  their  Step  4  results 
upward  along  column  0.  The  results  are  added  together  by  the  (0,0)  processor 
in  each  group. 

Step  6:  Perform  an  OTIS  move  on  the  sums  computed  by  the  (0,0)  processors  in  Step 
5. 

Step  7:  In  group  (0,0),  the  column  0  processors  shift  the  values  received  in  Step  6 
upward  to  the  (0,0)  processor  of  the  group.  The  (0,0)  processor  adds  these 
values  together. 


Figure  5.6.  GSM  Row  x  Column  Multiply  for  k  >  1 
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Step  1:  Perform  an  OTIS  move  on  A  values. 

Step  2:  Processor  0  of  each  group  broadcasts  its  A  value  to  the  remaining  processors 
in  its  group. 

Step  3:  All  processors  multiply  their  A  and  B  values. 
Step  4:  Perform  an  OTIS  move  on  the  products  computed  in  Step  3. 
Step  5:  Processor  0  of  each  group  sums  the  products  from  all  processors  in  the  same 
group. 

Step  6:  Perform  an  OTIS  move  on  the  sums  computed  in  Step  5. 

Figure  5.7.  GRM  Row  Vector  x  Matrix  Multiply 

group  (u.O)  are  summed  in  processor  (i™ 0,0,0).  The  OTIS  move  of  Step  6  moves 
the  resultant  sums  to  (0,0,^,0).  These  resultant  sums  are  added  together  in  Step 
7. 

The  number  of  data  moves  performed  by  the  algorithm  of  Figure  5.6  is  6VN+ 

4k  -  10  electronic  and  2  OTIS. 
fi        Row  Vector  x  Matrix 

GEM 

We  are  to  multiply  a  1  x  N  row  vector  A  and  an  N  x  N  matrix  B.  The  result 
is  a  1  x  N  vector  C  such  that  d  =  E&1  AjB*.  Initially,  A,  is  in  (0,  i)  and  B«  is  in 
(i,  j)  and  the  result  is  to  be  left  so  that  Q  is  in  (0,i).  The  multiplication  algorithm 

is  given  in  Figure  5.7. 

In  Step  1,  M  is  moved  from  (0,i)  to  (i,0).  Following  Step  2,  processor  fj,») 
has  Aj  and  Processor  fj,«)  computes  AjBfi  in  Step  3  and  in  Step  4,  AjBji  is 
moved  to  processor  (t,j).  Processor  (i,0)  computes  d  =  EjLY'4^  in  SteP  5-  In 
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Step  6,  d  is  moved  from  (t,  0)  to  (0,  i).  The  complexity  of  the  algorithm  is  A{yfN-\) 
electronic  and  3  OTIS  moves. 

When  A  is  a  1  x  kN  vector  and  B  a  kN  x  kN  matrix,  a  block  of  k  A  values 
are  moved  in  Step  1  of  Figure  5.7;  the  broadcast  of  the  A  block  in  Step  2  is  done 
in  2{y/N  +  k  -  2)  electronic  moves  by  pipelining  the  broadcast  of  the  k  values;  the 
multiplication  of  Step  3  is  between  a  1  x  Jb  vector  and  &  kxk  matrix;  and  the  OTIS 
move  of  Step  4  moves  lxi  blocks.  To  do  the  sum  of  Step  5,  we  first  sum  along 
rows.  This  is  done  in  y/N  +  k  -  2  electronic  moves  by  pipelining  the  k  sums  to  be 
computed.  Next  the  partial  sums  in  column  0  are  summed;  again  using  pipelining. 
Step  5  takes  2(y/N  +  k  -  2)  steps.  Adding  in  the  OTIS  move  of  Step  6,  the  total 
number  of  moves  becomes  Ay/N  +  4*  -  8  electronic  and  3  OTIS. 

GSM 

Our  GSM  algorithm  to  multiply  a  1  x  N  row  vector  A  and  an  N  x  N  matrix 
B  is  given  in  Figure  5.8.  Note  that  the  algorithm  begins  with  Aj  in  (0,;/,0,;'m)  and 

Bji  in  (j/,</,  jm,»m). 

Step  1  moves  Aj  from  (0Jf,0,jm)  to  (0,jf,jm,0)  and  following  Step  2,  Aj  is 
in  (0,j'/,im,*)-  Following  Step  3,  Aj  is  in  {jm,*,0Jf)  and  Bji  is  in  (jm,im,j/,if)- 
After  Step  5,  Aj  is  in  (jm>  *>.?/»*)•  Therefore,  in  Step  6,  processor  (jm,  i«, ;'/,«/) 
computes  AjBji.  In  Step  7,  processor  (jm,im,0,if)  computes  Ej9j  mod  Vs=jm  AiB^ 
which  is  then  sent  to  (0tif,jm,hn)  in  Step  8.  Finally  in  Step  9  (0,i/,0,im)  computes 
C<  =  ^2jLo  AjBji. 

Steps  1  and  4  take  2{y/N  - 1)  electronic  moves  each;  Steps  2,  5,  7,  and  9  take 
y/N-l  electronic  moves  each;  and  Steps  3  and  8  take  1  OTIS  move  each.  The  total 
number  of  moves  is  S(y/N-  1)  electronic  and  2  OTIS. 
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Step  1:  In  each  group  move  the  A  values  from  row  0  to  column  0  using  the  data 
paths  of  Figure  5.5. 

Step  2:  The  column  0  processors  broadcast  their  A  values  to  all  processors  in  the 
same  group  and  on  the  same  row. 

Step  S:  Perform  an  OTIS  move  on  all  A  and  B  values. 

Step  4-'  Same  as  Step  1. 

Step  5:  Same  as  Step  2. 

Step  6:  All  processors  multiply  their  A  and  B  values. 

Step  7:  The  processor  in  row  0  of  each  group  sum  the  products  of  Step  6  that  are  in 
the  same  column. 

Step  8:  Perform  an  OTIS  move  on  the  sums  of  Step  7. 

Step  9:  The  processors  in  row  0  of  group  (0,  *)  sum  the  values  received  in  Step  8  that 
are  in  the  same  column. 

Figure  5.8.  GSM  Row  Vector  x  Matrix  Multiply 
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Step  1:  Processor  0  of  each  group  broadcasts  its  B  value  to  all  processors  in  its  group. 

Step  2:  Perform  an  OTIS  move  on  B  values. 

Step  3:  All  processors  multiply  their  A  and  B  values. 

Step  4:  Processor  0  of  each  group  sums  the  products  computed  in  Step  3  by  all 
processors  in  its  group. 

Figure  5.9.  GRM  Matrix  x  Column  Vector  Multiply 

When  A  is  a  1  x  kN  vector  and  B  a  kN  x  kN  matrix,  Steps  1  and  4  can 

be  done  with  2(\/N  +  k  -  2)  electronic  moves  each  by  transmitting  the  k  values  in 

each  processor  in  a  pipelined  fashion;  Steps  2  and  5  take  y/N  +  k  -  2  (again  using 

pipelining)  electronic  moves;  Steps  7  and  9  can  be  done  in  VN  +  k  -  2  moves  each 

using  the  pipelined  summing  scheme  used  in  Step  5  of  Figure  5.7.  The  total  number 

of  moves  is  8(y/N  +  k  -  2)  electronic  and  2  OTIS. 

5.2.4   Matrix  x  Column  Vector 

GEM 

We  start  with  anNxN  matrix  A  and  an  N x  1  column  vector  B  and  compute 
the  TV  x  1  column  vector  C  such  that  d  =  EjLV  AijBj-  Initially,  A^  is  in  (ij)  and 
Bj  is  in  0,0)-  On  termination,  d  is  to  be  in  (t',0).  Our  algorithm  to  perform  the 
multiplication  is  given  in  Figure  5.9. 

Following  Step  1,  Bj f  is  in  (j,  *)  and  following  Step  2  it  is  in  (*,  j).  In  Step  3, 
(»',;')  computes  AijBj  and  in  Step  4,  (t,0)  computes  C>  =  E^j1  A^Bj.  The  number 
of  data  moves  is  4(>/N-l)  electronic  (Steps  1  and  4  each  require  2(^-1)  electronic 
moves)  and  1  OTIS. 

Theorem  5.8.2  The  GRM  matrix  x  column  vector  multiplication  algorithm  of  Fig- 
ure 5.9  is  optimal. 


Proof  Since  the  value  of  C0  depends  on  all  Mj  values,  information  about  all  these 
A  values  must  get  to  (0,0)  either  directly  or  indirectly.  For  this  to  happen,  at  least 
y/N  -  1  leftward  row  moves  and  y/N  -  1  upward  column  moves  must  be  made.  Let 
the  snake-like  row-major  index  of  the  bottom  right  processor  of  a  group  be  q.  Since 
Cq  =  E^jBj,  information  originally  in  (0,0)  (i.e.,  B0)  must  get  to  (q,0)  directly  or 
indirectly.  This  requires  a  minimum  of  y/N  -  1  rightward  row  moves  and  y/N  -  1 
downward  column  moves  plus  one  OTIS  move.  The  row  and  column  moves  required 
for  the  computation  of  C0  and  C%  are  in  opposite  directions  and  cannot  be  overlapped 
in  the  SIMD  model.  Therefore,  at  least  4(^-1)  electronic  and  1  OTIS  moves  are 
needed.  □ 

When  A  is  a  kN  x  kN  matrix  and  B  a  kN  x  1  vector,  we  use  the  algorithm 
of  Figure  5.9  and  pipelining  as  used  for  the  case  when  A  is  a  1  x  kN  vector  and  B  a 
kN  x  kN  matrix.  The  number  of  moves  is  A(y/N  +  k  -  2)  electronic  and  1  OTIS. 

GSM 

The  GSM  matrix  x  vector  multiplication  algorithm  is  very  similar  to  the  GSM 
vector  x  matrix  algorithm  of  Figure  5.8.  The  steps  are  given  in  Figure  5.10.  Note 
that  we  start  with  Aij  in  (iftjf,im,jm)  and  B*  in  (*/»0»*m,0). 

In  Step  1,  Bj  is  moved  from  0/.°»im,0)  to  (j/,0,0,im)-  Following  Step  2, 
Bj  is  in  0'/»°»*.Jm).  The  OTIS  move  of  Step  3  moves  Bj  to  (*JmJ/,0)  and  Ay  to 
(im,jm,ifjf)-  Steps  4  and  5  first  move  Bj  to  (*,jm,0,j/)  and  then  to  (*,jm,*,jf)- 
Following  Step  5,  (imJm,if,j/)  has  4;  and  By  In  Step  6,  computes 
AijBj.  In  Step  7,  processor  (im,im,«'/,0)  computes  mod  y/N=jm  AvBj>  which 
is  then  sent  to  (s/,0,«i»» 3m)  »n  Step  8.  Finally,  in  Step  9  (i/,0,im,0)  computes 
E^o1  A*jBJ-  The  total  number  °f  data  raoves  45  8(v^  -  1)  electronic  and  2  OTIS. 
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Step  1:  In  each  group  move  the  B  values  from  column  0  to  row  0  using  the  data 
paths  of  Figure  5.5  in  the  reverse  direction. 

Step  2:  The  row  0  processors  of  each  group  broadcast  their  B  values  to  all  processors 
in  the  same  group  and  on  the  same  column. 

Step  S:  Perform  an  OTIS  moves  on  all  A  and  B  values. 

Step  4'  Same  as  Step  1. 

Step  5:  Same  as  Step  2. 

Step  6:  All  processors  multiply  their  A  and  B  values. 

Step  7:  The  processor  in  column  0  of  each  group  sum  the  products  of  Step  6  that  are 
in  the  same  row. 

Step  8:  Perform  an  OTIS  move  on  the  sums  of  Step  7. 

Step  9:  The  processors  in  column  0  of  group  (*,0)  sum  the  values  received  in  Step  8 
that  are  in  the  same  row. 

Figure  5.10.  GSM  Matrix  x  Column  Vector  Multiply 
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Step  1:  Perform  an  OTIS  move  on  B. 

Step  2:  Each  processor  of  each  group  accumulates  all  A  and  B  values  in  its  group. 

Step  3:  Move  the  accumulated  B  values  along  the  OTIS  connection. 

Step  4:  Each  processor  computes  the  inner  product  of  the  A  and  B  values  it  has. 

Figure  5.11.  O(N)  Memory  GRM  Matrix  x  Matrix  Multiply 

In  the  case  when  A  is  a  kN  x  kN  matrix  and  B  a  kN  x  1  vector,  we  use 
the  algorithm  of  Figure  5.10  and  pipelining  as  used  for  the  case  when  A  is  a  1  x  kN 
vector  and  B&kNxkN  matrix.  The  number  of  moves  is  &{y/N  +k-2)  electronic 

and  2  OTIS. 

5.2.B    Matrix  x  Matrix 

O(N)  Memory  /Processor  Algorithms 

When  each  processor  has  O(N)  memory,  it  is  possible  to  accumulate  an  entire 
column  (or  row)  into  each  processor.  This  leads  to  simplified  algorithms.  Consider 
the  case  when  we  are  to  multiply  two  N  x  N  matrices  A  and  B. 

GRM    The  GRM  algorithm  is  given  in  Figure  5.11. 

We  begin  with  Ay  and  B„  in  Following  Step  1,  (ij)  has  Ay  and  Bfi. 

After  Step  2,  («,  j)  has  row  t  of  A  and  column  i  of  B.  Following  Step  3,  (i,;*)  has  row 
i  of  A  and  column  j  of  B.  In  Step  4,  (»,;')  computes  dj  =  Em 

Step  2  can  be  done  in  two  stages.  In  the  first  stage,  the  B  values  are  accumu- 
lated; and  in  the  second  stage  the  A  values  are  accumulated.  To  accumulate  the  B  val- 
ues, each  processor  first  accumulates  all  values  from  its  row.  This  takes  y/N- 1  right- 
ward  and  y/N -I  leftward  moves.  Next,  the  accumulated  blocks  of  VN  values  are  ac- 
cumulated along  columns  by  making  y/N  - 1  upward  and  y/N- 1  downward  moves  of 
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Step  1:  Perform  an  OTIS  move  on  A  and  B  values. 

Step  2:  Each  processor  accumulates  all  y/N  A  values  in  the  same  row  and  group  as 
well  as  all  y/N  B  values  in  the  same  column  and  group. 

Step  8:  Move  the  accumulated  A  and  B  values  along  the  OTIS  connection. 

Step  4:  Each  processor  accumulates  all  N  A  values  in  the  same  row  and  group  as 
well  as  all  N  B  values  in  the  same  row  and  group. 

Step  5:  Each  processor  computes  the  inner  product  of  the  A  and  B  values  it  has. 
Figure  5.12.  O(N)  Memory  GSM  Matrix  x  Matrix  Multiply 

blocks  of  size  y/N.  The  total  stage  1  moves  are  2y/N{y/N-\)+2(y/N-l)  =  2(N-l). 
Stage  2  is  done  similarly.  Step  3  takes  N/K  OTIS  moves  where  K  is  the  maximum 
number  of  B  values  that  can  be  moved  in  unit  time  over  an  optical  link.  The  total 
number  of  moves  needed  by  the  algorithm  of  Figure  5.11  is  4(N  -  1)  electronic  and 
N/K  + 1  OTIS.  Each  processor  needs  memory  for  N  A  values  and  N  B  values.  The 
memory  requirements  can  be  reduced  to  N  +  y/N  by  delaying  stage  2  of  Step  2  to 
after  Step  3  and  coupling  Step  4  with  the  columnwise  movement  of  the  y/N  size 
packets  of  A  during  stage  2. 

The  algorithm  of  Figure  5.11  is  easily  generalized  to  the  case  when  A  and  B 
are  kN  x  kN  matrices.  Operations  previously  performed  on  matrix  elements  are  now 
performed  on  k  x  A:  blocks  of  elements.  The  data  movement  counts  are  1  OTIS  in 
Step  1,  4k2(N  -  1)  electronic  in  Step  2,  and  k2N/K  OTIS  in  Step  3.  The  total  is 
4k2(N  -  1)  electronic  and  k2N/K  +  1  OTIS. 

GSM  Our  GSM  algorithm  to  multiply  two  N  x  N  matrices  is  given  in 
Figure  5.12. 
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Following  Step  1,  A{j  and  B{j  are  in  (Wm,  i/,jf).  Following  Step  2,  (im,jm,  if,  * 
has  the  v/N  A  values  A*,  such  that  gm  =  jm  and  (im,  jm,  *,;'/)  has  the  y/N  B  values 
Brj  such  that  rm  =  tm.  These  \/N  blocks  of  A  and  £  values  are  then  moved  to 
(t/,  *,  t^,  jm)  and  (*,  jh  tm,  jm),  respectively.  Following  Step  4,  (t,,  *,  tm,  *)  has  row  i 
oiA  and  (*,;'/,  *,jm)  has  column  j  of  B.  Therefore,  (i/Jf,im,jm)  has  row  t  of  A  and 
column  of  B.  The  inner  product  computation  of  Step  5  leaves  Cy  =  E^=oX  A^B^ 
in  (t/,i/,tm,  jm). 

Step  1  takes  1  OTIS  move.  Step  2  takes  2(v/5V  -  1)  electronic  row  moves  to 
accumulate  the  A  values  and  2(y/N  -  1)  electronic  column  moves  to  accumulate  the 
B  values.  Step  3  takes  2y/N/K  OTIS  moves.  In  Step  4,  each  electronic  move  moves 
y/N  data.  Since  a  total  of  4(v^V  - 1)  moves  are  made,  the  total  cost  is  4\/N(y/N- 1) 
unit  electronic  moves.  Hence  the  total  number  of  moves  is  4(N  -  1)  electronic  and 
2-/N/K  + 1  OTIS. 

The  algorithm  is  easily  extended  to  the  case  when  A  and  B  are  kN  x  kN 
matrices.  The  number  of  moves  is  4k2 (N  -  1)  electronic  and  2k7y/N/K  +  1  OTIS. 
Q(\)  Memory  /Processor  Algorithms 

Our  0(1)  memory  algorithm  is  based  on  Cannon's  algorithm  [2]  to  multiply 
two  N  x  N  matrices  on  an  N  x  N  mesh  connected  computer.  Cannon's  algorithm 
was  also  used  by  Dekel,  Nassimi,  and  Sahni  [6]  in  their  development  of  hypercube 
algorithms  for  the  matrix  multiplication.  Cannon's  algorithm  is  given  in  Figure  5.13. 

OEM  We  simulate  Cannon's  algorithm  on  the  OTIS-Mesh.  While  obtaining 
the  alignment  of  Step  1,  we  also  obtain  the  reverse  of  each  aligned  row  of  A  and  each 
aligned  column  of  B.  The  process  for  B  is  similar  to  that  used  for  A  except  that  we 
must  precede  and  follow  the  algorithm  for  A  by  an  OTIS  move  (the  preceding  OTIS 
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Step  1:  [Align  Matrix  Elements]  Move  AitU+i)mods  and  B0+t)mOd^j  to  mesh  pro- 
cessor (i,  j). 

Step  2:  [Initialize  Cy]  Processor  (i,  j)  initializes  its  C  value  to  the  product  of  its  A 
and  £  values. 

Step  3:  [Compute  and  Add  Remaining  Terms] 

Repeat  N  -  1  times: 

{  Shift  A  values  left  circularly  by  1; 

Shift  B  values  up  circularly  by  1; 

C  =  C  +  A  *  B;  } 


Figure  5.13.  Cannon's  Matrix  Multiplication  Algorithm 

Step  1:  In  each  group,  move  A  values  upward  along  columns.  As  data  moves  through 
a  processor,  the  processor  saves  a  copy  in  case  it  is  needed  in  the  row  the 
processor  is  in. 

Step  2:  Same  as  Step  1  except  that  A  values  are  moved  downward. 
Step  3:  In  each  row  of  each  group,  form  the  forward  ordering. 
Step  4:  In  each  row  of  each  group,  form  the  reverse  ordering. 

Figure  5.14.  Moving  A  Values  as  per  Step  1  of  Cannon's  Algorithm 

move  gets  all  B  elements  in  the  same  column  into  the  same  group  and  the  following 
OTIS  move  gets  the  columns  to  the  proper  processors).  We  describe  the  alignment 

of  rows  of  A  only  (Figure  5.14). 

Steps  1  and  2  each  take  y/N  -  1  electronic  moves.  Following  Step  2,  a  pro- 
cessor can  have  up  to  four  A  values— 2  belonging  to  the  aligned  ordering  of  Step  1 
of  Cannon's  algorithm,  and  2  belonging  to  the  reverse  of  this  ordering.  Each  row 
contains  a  total  of  2y/N  values  with  each  processor  in  the  row  having  0,  1,  2,  3,  or 
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4  values.  These  values  can  be  moved  to  the  proper  processors  on  the  same  row  by 
making  0(y/N)  leftward  and  rightward  data  moves.  To  align  B  takes  0{y/N)  elec- 
tronic and  2  OTIS  moves.  Therefore,  making  0{y/N)  electronic  and  2  OTIS  moves, 
we  can  obtain  the  Step  1  alignment  of  Cannon's  algorithm  as  well  as  the  reverse  of 
this  alignment. 

The  circular  shift  of  A  in  Step  3  of  Cannon's  algorithm  can  be  implemented  as 
a  forward  shift  along  the  snake  of  the  reverse  alignment  and  a  backward  shift  along 
the  snake  of  the  aligned  data.  So  each  circular  shift  takes  4  electronic  moves. 

To  do  the  circular  shift  on  B,  we  retain  a  copy  of  the  aligned  and  reversed 
B  in  each  group  prior  to  the  second  OTIS  move  done  in  Step  1.  For  each  circular 
shift,  we  make  4  electronic  moves  in  each  group  on  the  copy  of  B  and  then  do  an 
OTIS  move  to  get  the  shifted  B  values  to  the  desired  processors.  The  total  moves 
required  by  Step  3  b  (N  - 1)  x  (8  electronic  and  1  OTIS  )  =  8(N  -  1)  electronic 
and  N  - 1  OTIS.  Therefore,  the  GRM  simulation  of  Cannon's  algorithm  can  be  done 
using  8N  +  0(y/N)  electronic  and  N  +  1  OTIS  moves. 

The  simulation  just  described  works  even  when  A  and  B  are  kNxkN  matrices. 
Now,  each  element  that  is  moved  is  a  Jfc  x  k  block.  Therefore  an  electronic  block  move 
takes  k2  electronic  move  steps.  The  number  of  moves  becomes  Bk2N  +  0(k2y/N) 
electronic  and  N  +  1  OTIS. 

£SM  To  multiply  two  N  x  N  matrices  we  use  a  two  level  simulation  of 
Cannon's  algorithm.  At  the  top  level,  we  view  each  N  x  N  matrix  as  a  x  y/N 
matrix  in  which  each  element  xs&y/Nxy/N  submatrix.  Let  A  and  B  be  the  N  x  N 
matrices  to  be  multiplied  and  let  BA  and  BB  be  the  corresponding  y/N  x  y/N 
matrices  in  which  each  element  is  a  y/N  x  y/N  block  or  submatrix  of  A  and  B, 
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respectively.  Initially,  BMj  and  J5B0  are  in  group  (»,  j)  of  the  OTIS-Mesh.  Let 
C  =  A  x  B  and  let  BC  be  the  corresponding  y/N  x  y/N  matrix  of  blocks  of  size 
y/N  x  VN  each.  Since  £C0  =  E^S"1  BA*  x  BBt))  we  can  use  Cannon's  algorithm 
to  compute  BC.  The  products  of  Steps  2  and  3  now  are  products  of  submatrices  or 
blocks  of  size  y/N  x  y/N,  each  block  is  in  an  OTIS  group  which  is  a  y/N  x  mesh. 
These  submatrix  products  can  in  turn  be  done  using  Cannon's  algorithm  (this  is  the 
second  level  application  of  Cannon's  algorithm). 

To  implement  the  two  level  scheme,  we  use  the  algorithm  of  Figure  5.15. 

Steps  1  and  2  do  the  data  alignment  necessary  to  perform  Steps  2  and  3  of 
Cannon's  algorithm  to  multiply  two  blocks/submatrices  of  size  y/N  x  y/N  each.  The 
forward  and  backward  ordering  of  the  A  values  can  be  obtained  by  making  y/N  -  1 
leftward  and  y/N-l  rightward  moves  of  A  values.  Similarly  the  forward  and  backward 
ordering  of  B  values  can  be  done  using  2(y/N  -  1)  column  moves.  Following  Step  2, 
each  processor  has  2  A  values  (one  from  the  forward  ordering  and  the  other  from  the 
backward  ordering)  and  2  B  values. 

In  Steps  3  and  4  the  y/N  x  y/N  blocks  of  submatrices  of  A  and  B  are  aligned. 
For  this,  an  OTIS  move  is  made  on  the  A  and  B  values,  followed  by  2(y/N  -  1) 
electronic  row  moves  and  2{y/N  -  1)  electronic  column  moves,  and  6nally  an  OTIS 
move.  For  the  final  OTIS  move,  we  leave  a  copy  of  the  As  and  Bs  in  the  originating 
processors  also.  Now  each  processor  has  8  A  and  8  B  values. 

Step  5  is  done  using  Steps  2  and  3  of  Cannon's  algorithm  at  a  cost  of  A(y/N-l) 
electronic  moves.  In  Step  6,  the  A  and  B  blocks  are  shifted  by  using  the  copies  saved 
during  the  second  OTIS  moves  of  Steps  3  and  4  followed  by  an  OTIS  move.  This 
shifting  of  A  and  B  blocks  takes  2  row  electronic  moves  (both  forward  and  backward 
A  blocks  are  to  be  shifted  in  the  opposite  direction)  plus  2  column  electronic  moves 
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Step  1:  [Align  A  data  within  each  group/block]  Reorder  A  values  in  each  row  of 
each  group  so  that  the  A  value  originally  in  (*,  *,  i,  (i  +  j)  mod  y/N)  is  now  in 
(*,  *,  Call  this  the  forward  A  ordering.  Also  create  the  reverse  of  this  row 
ordering  in  each  group.  Call  this  the  backward  ordering. 

Step  2:  [Align  B  data  within  each  group/block]  Reorder  B  values  in  each  column  of 
each  group  so  that  the  B  value  originally  in  (*,  *,  (t  +  j)  mod  y/NJ)  is  now  in 
(*,*,*»;)•  Also  create  the  backward  column  ordering  for  the  Bs. 

Step  3:  [Align  the  A  blocks]  Rearrange  the  blocks  of  A  values  obtained  in  Step  1  so 
that  the  block  originally  in  group  (t,  (i+j)  mod  y/N)  (i.e.,  in  processors  (i,  (t  + 
j)  mod  V/N,*,*))is  now  in  the  group  of  (»,».  Also  create  the  corresponding 
backward  row  ordering  for  the  A  blocks. 

Step  4:  [Align  the  B  blocks]  Rearrange  the  blocks  of  B  values  obtained  in  Step  2  so 
that  the  block  originally  in  group  ((t  +;')  mod  \/N,j)  is  now  in  group 
Also  create  the  corresponding  backward  column  ordering  for  the  B  blocks. 

Step  5:  [Initialize  block  BCy]  BC0  =  BMj  x  BBy. 
Step  6:  [Compute  and  add  remaining  terms] 

Repeat  y/N  -  1  times: 

{  Shift  A  blocks  left  circularly  by  1  group; 

Shift  B  blocks  up  circularly  by  1  group; 

BCij  =  Bdj  +  BAij  x  BBi} ,;  } 


Figure  5.15.  GSM  Matrix  x  Matrix  Multiply 
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Table  5.1.  Comparison  between  GRM  and  GSM  schemes 


Embedding  Scheme 

GRM 

GSM 

Operation 

Electronic 

OTIS 

Electronic 

OTIS 

Column  x  Row 

ly/N 

2 

4(VN  -  1) 

2 

Row  x  Column 

2(VN  - 1) 

1 

5(VN  -  1) 

2 

Row  x  Matrix 

4(VN  - 1) 

3 

S(VN-l) 

2 

Matrix  x  Column 

4(VN  - 1) 

1 

8(VN-\) 

2 

Matrix  x  Matrix  O(N) 

4{N  - 1) 

N/K  +  l 

4(N-1) 

2yfN/K  + 1 

Matrix  x  Matrix  0(1) 

SN  +  0(VN) 

N  +  \ 

4N  +  O(VN) 

for  B  blocks  and  1  OTIS  move.  The  block  matrix  multiply  is  done  using  steps  2  and 
3  of  Cannon's  algorithm  at  a  cost  of  4(y/N- 1)  electronic  and  y/N- 1)  OTIS  moves. 
The  total  number  of  moves  made  by  the  GSM  algorithm  is  4N  +  0{VN)  electronic 
and  VN  OTIS. 

The  algorithm  of  Figure  5.15  is  easily  extended  to  the  case  when  A  and  B  are 

kN  x  kN  matrices.  The  essential  difference  is  that  each  element  of  a  y/N  x  y/N  block 

is  now  itself  a  k  x  k  block.  So,  each  electronic  data  move  becomes  k7  unit  moves. 

The  number  of  data  moves  is  therefore  4k7N  +  0(k2y/N)  electronic  and  VN  OTIS. 

53  Summary 

We  have  developed  OTIS-Mesh  algorithms  for  several  variants  of  the  matrix 
multiplication  problem.  For  each  variant,  we  have  considered  both  the  group  row 
mapping  and  the  group  submatrix  mapping.  Our  results  are  summarized  in  Table  5.1. 
As  can  be  seen,  the  GSM  mapping  is  superior  for  the  case  of  matrix  x  matrix 
multiplication.  However,  for  all  other  variants  the  GRM  is  superior.  As  noted  in 
Section  2.3,  GRM  is  also  superior  for  the  matrix  transpose  operation. 


CHAPTER  6 
IMAGE  PROCESSING  ON  AN  OTIS-MESH 

In  this  chapter,  we  focus  on  four  problems  from  the  image  processing  area. 
These  problems  are  histogramming,  histogram  modification,  Hough  transform,  and 
image  shrinking  expanding.  As  noted  in  Section  2.3,  there  are  two  plausible  ways  to 
map  an  AT  x  AT  image  onto  an  N2  processor  OTIS-Mesh-group  row  mapping  (GRM), 

and  group  submesh  mapping  (GSM). 

Our  histogramming  and  histogram  modification  algorithms  are  insensitive  to 
how  the  image  is  mapped  onto  the  OTIS-Mesh.  Therefore,  these  algorithms  are 
developed  without  regard  to  the  mapping  used.  The  algorithms  for  Hough  transform 
and  image  shrinking  and  expanding  depend  on  the  particular  mapping  used.  In 
Sections  6.3  and  6.4,  we  develop  algorithms  for  both  the  GRM  and  GSM  mappings. 

6,1  Histogramming 

6J  1  Badtflrround 

The  input  to  the  histogramming  problem  is  an  N  x  N  digitized  image  /  with 
/(»,;)  being  the  gray  level  of  pixel  (t,j),  0  <  ij  <  N.  The  gray  levels  are  integers  in 
the  range  (0,B);  that  is,  0  <  /(«,  j)  <  B,  0  <  ij  <  N.  The  histogram  of  the  image 
is  a  vector  H  such  that  H[b]  is  the  number  of  pixels  with  gray  value  b,0<b<B. 

Parallel  algorithms  to  compute  the  histogram  of  an  image  have  been  devel- 
oped for  many  parallel  architectures.  For  example,  Siegel  et  al.[49]  have  developed 
a  histogramming  algorithm  for  a  p  processor  PASM  multicomputer,  p  <  N7,  and 
Yasrebi  et  o/.[57]  have  done  this  for  the  TRAC  multicomputer.  Grinberg  et  a/. [10] 
have  developed  an  algorithm  to  compute  the  histogram  on  an  JV2  processor  cellu- 
lar machine  called  the  3-D  machine;  Tanimoto  [51]  has  developed  an  0(B  +  log  N) 
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algorithm  for  a  pyramid  computer  with  an  N  x  N  base;  Bestul  and  Davis  [1]  have 
developed  an  0(VB  +  log(AtyB))  algorithm  for  an  N7  processor  SIMD  hypercube; 
and  Jenq  and  Sahni  [18]  and  Jang  et  c/.[15]  have  developed  algorithms  for  various 
reconfigurable  mesh  models. 

The  histogramming  algorithms  of  Jang  et  al.[lb]  and  Jenq  and  Sahni  [18] 
partition  B  into  the  ranges  0  <  B  <  VN,  VN<B<NandB>Nand  solve 
for  each  range  separately.  Further,  where  appropriate,  they  consider  the  cases  0(1) 
memory  per  processor  and  0(B)  memory  per  processor.  We  shall  follow  this  strategy 
here  also. 

fi  1  9     Algorithm  for  (1  <  B  <  JN 

In  this  case,  the  histogram  is  left  in  row  0  of  the  group  (0,0)  mesh.  Using 
our  four-dimensional  indexing  scheme,  processor  (0,0,0,  i)  will  have  H[t\,  0  <  i<  B, 
following  the  histogram  computation.  Our  strategy  is  (a)  compute  the  histogram  for 
each  row  of  each  y/Nx^N  mesh,  (b)  use  the  row  histograms  to  obtain  the  histogram 
for  each  group,  and  (c)  combine  the  group  histograms  into  a  single  histogram.  More 
formally,  our  algorithm  for  the  case  0  <  B  <  y/N  is: 

Step  1:  Processor  (gx,gy,Px,Py)  determines  the  number  of  pixels  on  its  row  of  the 
group  {gx,g,)  mesh,  0  <  pt  <  B. 

Step  2:  Processor  (gx,gvAPv)  *d<k  up  the  values,  along  its  column,  that  were  com- 
puted in  Step  1,  0  <  JV  <  B. 

Step  3:  Perform  an  OTIS  move  on  the  values  computed  in  Step  2. 

Step  4:  Processor  (0,^,0,0)  sums  up  all  the  values  received  in  Step  3  by  processors 
in  group  (0,gy),  0<gv<B. 
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Step  5:  Perform  an  OTIS  move  on  the  results  computed  in  Step  4. 

Step  1  is  accomplished  by  shifting  the  image  values  first  leftward  and  then 
rightward  within  rows  of  the  meshes.  When  an  image  value  passes  through  processor 
(<7*,<7v>P*>Pv)>  this  Processor  increments  its  counter  if  the  image  value  equals  pv. 
y/fi  -  1  leftward  and  B  -  1  rightward  shifts,  for  a  total  of  y/N  +  B  -  2  electronic 
moves  are  needed.  Step  2  is  done  by  shifting  the  counts  upwards  along  columns  to 
row  0  of  the  group.  This  step  takes  y/N  -  1  electronic  moves.  In  Step  4  we  add  all 
values  inav^xV^  mesh.  This  takes  2{VN  -  1)  electronic  moves.  Therefore, 
histogramming  can  be  done  with  4(y/N  -  1)  +  B  -  1  electronic  and  2  OTIS  moves. 

Thmrrm  6.1.1  Our  histogramming  algorithm  forO<B<y/N  is  optimal 

Proof  We  need  to  show  that  every  OTIS-Mesh  histogramming  algorithm  must  make 
4(t/N  -  1)  +  B  -  1  electronic  and  2  OTIS  moves  when  0  <  £  <  y/N.  To  see 
this,  consider  an  image  in  which  7(0,0)  =  B  -  1  and  I(N  -  1,N  -  1)  =  0.  Since 
7(0,0)  is  mapped  to  processor  (0,0,0,0)  of  the  OTIS-Mesh  and  since  77[B  -  1]  is 
left  in  processor  (0,0,0,75  -  1),  it  is  necessary  for  the  histogramming  algorithm  to 
move  in  formation  from  (0,0,0,0)  to  (0, 0, 0,  B  - 1),  requiring  at  least  B  -  1  electronic 
moves  that  increase  the  row  index  (note  that  OTIS  moves  can  only  transpose  indices, 
not  change  their  value).  Further,  since  7(W  -  1,7V  -  1)  is  mapped  to  processor 
{y/N-  l,\/N-  \,VN-  \,y/N-  1)  and  77(0]  is  left  in  (0,0,0,0),  it  is  necessary  to 
make  at  least  y/N  - 1  electronic  moves  to  decrease  each  of  the  four  indices  gx,  g„  px, 
and  p„,  a  total  of  A(\/N  -  1)  electronic  moves.  Since  moves  that  increase  an  index 
cannot  be  overlapped  with  those  that  decrease  an  index  in  the  SIMD  model,  at  least 
4^^^  _  i)  4.  B  -  1  electronic  moves  are  necessary. 
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For  the  2  OTIS  moves,  we  see  that  information  from  all  processors  in  group 
(\/N -\,\/N  -  1)  (say)  must  get  to  group  0.  Assume  that  group  (y/N  -  1,  y/N  -  1) 
has  at  least  2  gray  values.  To  get  information  out  of  a  group,  an  OTIS  move  must  be 
made.  A  single  OTIS  move,  however,  moves  data  from  different  processors  of  a  group 
into  processors  of  different  groups.  Therefore,  at  least  2  OTIS  moves  are  necessary 
to  move  different  data  from  2  or  more  different  processors  into  a  single  other  group. 
□ 

fi  1  3    Algorithm?  fnrJN<B<N 

We  first  present  an  algorithm  that  uses  0(1)  memory  per  processor.  Next, 
we  present  an  optimal  algorithm  that  uses  0(VB)  memory  per  processor.  Our  0(1) 
memory  per  processor  algorithm  leaves  the  histogram  in  the  processors  of  group  0, 
one  histogram  value  per  processor.  More  specifically,  processor  (0,0,  pt,P,)  contains 
H\pxVN+Py].  The  algorithm  is  given  below: 

Step  1:  Processor  (px,pj  of  each  group  computes  H]pxyfN for  the  subimage  in 
its  group.  This  is  done  by  sorting  the  gray  values  in  a  group  using  the  integer 
sort  algorithm  of  Krizanc  [24].  During  the  sort,  equal  gray  values  are  combined 
into  a  single  gray  value. 

Step  2:  Perform  an  OTIS  move  on  the  H  values  computed  in  Step  1. 

Step  S:  Processor  (0,0)  of  each  group  sums  the  H  values  in  its  group  that  were 
received  in  Step  2.  That  is,  processor  (gx,gv,0,0)  computes  H\gxy/N  +  gy]  for 
the  entire  image. 

Step  4:  Processor  (0,0)  of  each  group  performs  an  OTIS  move  on  the  sum  computed 
in  Step  3.  Following  this  move,  processor  (0,0,  px,p,)  has  H\pxy/N  +  p„]. 
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Step  1  takes  4\/N  +  o(y/N)  electronic  moves;  Steps  2  and  4  take  1  OTIS  move 
each;  and  Step  3  takes  2(VN  -  1)  electronic  moves.  The  total  number  of  moves  is 
6y/N  +  o(VN)  electronic  moves  and  2  OTIS  moves. 

Theorem  6.1.2  Every  histogramming  algorithm  for  the  case  VN  <  B  <  N  must  make 
at  least  5{y/N  - 1)  +  [(B  -  l)/VN\  - 1  electronic  and  2  OTIS  moves  to  compute  the 
histogram  configuration  obtained  by  the  0(1)  memory  algorithm.  When  the  output 
configuration  has  the  histogram  in  a  s/B  x  y[B  submesh  of  group  (0,0),  at  least 
4\/W  +  2\/B  -  6  electronic  and  2  OTIS  moves  are  needed. 


Proof  Consider  the  image  in  which  the  gray  value  of  the  pixel  in  processor  (VN  - 
ly  -  1,  VN  -  1,  VN  -  1)  is  0  and  the  pixels  in  group  (y/N  -  \,VN-1)  have 
at  least  2  different  gray  values.  Using  the  reasoning  in  Theorem  6.1.1,  we  see  that 
at  least  4(y/N  -  1)  electronic  and  2  OTIS  moves  are  needed  to  get  the  histogram 
information  from  the  group  (v^-l.V^-l)  processors  to  the  target  processors  in 
group  (0,0). 

Next,  suppose  that  the  pixel  in  processor  (0, 0, 0, 0)  has  gray  value  t;  such  that 
u  mod  v/tf  =  V^- land  [v/VN\  =  [(B-\)/^N\-l.  It  takes  [(B-\)/^/N\-l  + 
y/N-l  electronic  moves  to  get  information  from  (0,0,0,0)  to  (0,0,  [(B  -  l)/y/N\  - 
1,  y/N- 1)  and  these  electronic  moves  cannot  be  overlapped  with  those  used  to  move 
information  from  {s/N - 1, y/N-ltVN- 1, ^N- 1)  to  (0,0,0,0).  Therefore,  at  least 
5(y/N  -  1)  +  [(B  -  1)/VN\  -  1  electronic  and  2  OTIS  moves  are  needed  to  obtain 
the  output  configuration  obtained  by  the  0(1)  memory  algorithm. 

For  the  VB  x  y/B  submesh  output  configuration,  suppose  that  the  gray  value 
in  processor  (0,0,0,0)  is  B  -  1.  Therefore,  information  needs  to  flow  from  (0,0,0,0) 
to  (0,0,  \/B  -  l,y/B  -  1),  requiring  2(VB  -  1)  electronic  moves  that  cannot  be 
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overlapped  with  the  electronic  moves  made  when  moving  information  from  (y/N  - 
lf  Jfi _  if  JJj _  i,  vffi- 1)  to  (0,0,0,0).  Thus  a  total  of  4y/N  +  1y/B  -  6  electronic 
and  2  OTIS  moves  are  necessary.  □ 

An  optimal  histogramming  algorithm  for  the  y/B  x  y/B  submesh  output  con- 
figuration is  possible  when  0(y/B)  memory  per  processor  is  available.  The  algorithm 
given  below  adapts  the  method  used  in  Jenq  and  Sahni  [18],  and  assumes  that  B  is 
a  perfect  square  and  that  y/B  divides  y/N. 

Step  1:  Tile  each  y/N  x  y/N  mesh  by  y/B  x  y/B  tiles. 

Step  2:  Processor  t  on  each  row  of  each  y/B  x  y/B  tile  computes  an  array  A[0  :  y/B -I] 
of  values  such  that  A[j]  equals  the  number  of  pixels  in  that  row  of  the  tile  whose 
gray  value  is  j y/B  +  »,  0  <  t  <  y/B. 

Step  S:  Processors  in  the  same  column  of  each  tile  perform  a  consecutive  sum  oper- 
ation; processor  i  of  a  tile  column  sums  the  A[i]  values  of  the  processors  on  its 
column.  Following  this  step,  processor  (i,  j)  of  a  tile  has  the  number  of  pixels 
in  its  tile  whose  gray  value  is  iy/B  +  j. 

Step  4:  Perform  a  window  sum  operation  on  the  results  of  Step  3  using  a  window 
size  y/B  x  y/B.  This  operation  does  not  span  group  boundaries.  The  result 
of  the  window  sum  operation  is  in  the  top  left  y/B  x  y/B  window/tile  of  each 
group.  Following  this  operation,  processor  (gx,g,,i,j)  has  the  number  of  pixels 
in  group  (gx,9y)  whose  gray  value  is  iy/B  +  j. 

Step  5:  Do  an  OTIS  move  on  the  values  computed  in  Step  4. 

Step  6:  Processor  (gx,9yi0>0)  sums  all  the  values  received  by  its  group. 

Step  7:  Do  an  OTIS  move  on  the  values  computed  by  (pt,<7„,0,0)  in  Step  6. 
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For  the  time  complexity,  we  see  that  Steps  2  and  3  take  2{s[B  -  1)  electronic 

moves  each;  Step  4  takes  2(y/N  -  %fB)  electronic  moves;  Steps  5  and  7  take  1  OTIS 

move  each;  and  Step  6  takes  2(y/N- 1)  electronic  moves.  The  total  number  of  moves 

is  4y/N  +  2y/B  -  6  electronic  and  2  OTIS  moves. 
fi  1  4    Algorithm  for  B  >  N 

This  case  can  be  done  with  22VN  +  0{N3'*)  electronic  and  0(N3/*)  OTIS 

moves  by  modifying  the  sort  algorithm  in  Section  4.13  so  that  during  the  sort,  pixels 

with  the  same  gray  value  are  combined  into  a  single  pixel. 

B  ?,    HifftngrftTP  Modification 

Histogram  modification  is  the  process  of  changing  the  gray  values  of  an  image 
based  on  a  mapping  function  /;  f(t)  gives  the  new  gray  value  for  pixels  whose  original 
gray  value  is  i,  0  <  t  <  B.  In  histogram  flattening  or  equalization  [40],  the  function 
/  is  computed  by  first  computing  the  prefix  sums  S\i]  =  Ej=0  H[j],0<i<Bof  the 
histogram.  Next,  /(x)  is  obtained  using  one  of  the  following  equations 

m  =  [SWB\,0<i<B, 

or 

/(i)«[Mg*=Uj,o<t<B 

where  S[-l]  =  0. 

In  the  OTIS  implementation  of  histogram  flattening,  we  explicitly  consider 
only  the  case  B  =  N.  Other  values  of  B  may  be  handled  similarly.  The  prefix 
sums  may  be  computed  from  the  histogram  (which  is  in  group  0)  using  3(\/N  -  1) 
electronic  moves  (Section  4.3).  To  compute  /  using  the  first  equation,  no  additional 
moves  are  needed.  When  the  second  definition  is  used,  additional  electronic  moves 
are  needed  to  shift  the  prefix  sums  by  1  processor. 
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Following  the  computation  of  /,  the  gray  values  of  all  pixels  must  be  updated 
according  to  /.  When  we  are  limited  to  0(1)  memory  per  processor,  this  updating  of 
pixel  values  may  be  done  by  first  performing  a  window  broadcast  of  the  /  values  to 
all  groups.  This  broadcast  can  then  be  followed  by  a  random  access  read  (RAR)  in 
which  each  processor  obtains  the  needed  /  value  from  within  its  group.  The  window 
broadcast  takes  2(y/N  -  1)  electronic  moves  and  the  RAR  takes  23\/N  +  o(y/N) 
electronic  moves  [54].  Thus,  the  updating  phase  takes  25\/N  +  o(VN)  electronic  and 
2  OTIS  moves. 

When  0(y/N)  (=  0(y/B))  memory  per  processor  is  available,  the  updating 
of  group  values  may  be  done  by  first  doing  a  window  broadcast  of  the  /  values  as 
was  done  in  the  0(1)  memory  case.  Next,  each  processor  accumulates  the  /  values 
in  the  VN  processors  that  are  in  its  column.  This  accumulation  is  done  in  an  array 
C.  For  a  processor  in  column  j  of  its  group,  C[i]  =  f(iy/N  +  j),  0  <  i,j  <  yffi. 
This  accumulation  step  takes  2(y/N  - 1)  electronic  moves.  Following  the  construction 
of  the  C  arrays,  each  processor  sends  a  token  to  the  processor  on  its  row  that  has 
the  /  value  it  needs.  When  the  token  reaches  the  target  processor,  the  /  value  is 
written  into  the  token,  and  the  token  returned  to  the  originating  processor.  This 
token  send/receive  step  can  be  broken  into  two  phases — one  in  which  tokens  are  sent 
to  and  received  from  processors  to  the  left  of  the  source  processors  and  another  in 
which  the  target  processors  are  to  the  right.  Each  of  these  phases  takes  2{VN  -  1) 
electronic  moves.  Thus,  the  0{y/N)  memory  algorithm  takes  &{y/N  -  1)  electronic 
and  2  OTIS  moves  to  update  the  gray  values  following  the  computation  of  /. 

The  complexity  of  the  0(VN)  memory  updating  algorithm  can  be  reduced  to 
6(y/N  -  1)  electronic  and  2  OTIS  moves  if  the  histogram  computation  phase  saves, 
in  processors  0  and  y/N  -  1  of  each  row  of  a  group,  the  gray  values  of  all  y/N  pixels 
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in  that  row.  This  can  be  done  without  increasing  the  number  of  moves  taken  by  the 
histogramming  algorithm.  When  processors  0  and  v^V  -  1  of  each  row  know  the 
gray  values  of  all  pixels  in  their  row,  the  pixel  values  can  be  updated  using  the  C 
arrays  in  2{VF-1)  electronic  moves  rather  than  in  A(y/N-l)  electronic  moves.  In 
the  new  method,  processor  0  of  each  row  initiates  token  t^jf_2,  . . .,  h  in  that 

order.  Token  U  contains  the  gray  value  of  the  pixel  in  processor  i  of  the  row.  The 
tokens  move  rightward,  one  processor  at  a  time.  When  a  processor  receives  a  token, 
it  checks  the  token's  gray  value  and  appends  the  updated  value  for  that  gray  value 
in  case  this  updated  value  is  stored  in  the  processor's  C  array.  The  rightward  token 
moves  are  made  for  exactly  -  1  steps.  Following  this,  processor  VFf  -  1  of  each 
row  initiates  tokens  to,  tu  . . t^_2  that  move  leftward  for  exactly  y/N  -  1  steps. 
At  the  end  of  these  moves,  each  processor  should  have  received  a  pair  (t,  /(f))  where 

t  is  the  original  gray  value  in  the  processor. 

fi.3    Shrinkjpp  *nd  Expanding 

fi.3.1  Background 

Let  /  be  an  N  x  N  image  and  let  B^i,  j]  represent  the  pixel  block: 

{[«,  v]|0  <  u,  v  <  N,  maxflu  -  i|,  \v  -  j\}  <  q}. 

The  ?-step  expansion,  E*,  and  the  g-step  shrinking,  5*,  of  /  are  given  by  the 
equations  [43]: 

=  max{/[u,v]|[u,t>]  €  B^i,  j]},  0<iJ<N 
S"[iJ]  =  min{/[u,u]|[u,v]  6  B^+i[i,j)}t  0  <  ij  <  N 

Rosenfeld  [43]  has  developed  an  0{k)  algorithm  to  compute  E2k~1  and  S2*-1 
at  coarsely  resampled  points  using  a  pyramid  computer  with  an  N  x  N  base.  Jenq 
and  Sahni  [19]  develop  an  0(y/q)  time  algorithm  to  compute      and  5*  exactly  on 
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an  N  x  N  base  pyramid.  Ranka  and  Sahni  [42]  have  developed  an  O(logTV)  time 
algorithm  to  compute  E*  and  S"  exactly  using  an  processor  hypercube  computer. 
Jenq  and  Sahni  have  developed  an  0(1)  time  RMESH  algorithm  to  compute  E*  and 

5*  for  binary  images  [16]. 

Since  image  expansion  and  shrinking  are  computationally  equivalent,  we  dis- 
cuss only  image  expansion  explicitly.  Following  Ranka  and  Sahni  [42],  we  compute 
E*  using  the  decomposition  given  below. 


E%j\  =  max{top%j],bottam'[ij]},  where 

top'[iJ]  =  max{R'[uJ]\0<i-u<q}1  and 

bottomq[ij]  =  max{K'[uJ]\0<u-i<q}. 

H%j[  =  max{leffi[i,j],right%j]},  where 

left'[ij)  =  max{/[t,t>]|0  <  j  -  v  <  q},  and 

righfi[i,j\  =  max{/[»',t;]|0  <  v  -j  <  q}\ 


The  algorithm  to  use  to  compute  right!  depends  on  whether  the  image  is 

mapped  onto  the  OTIS  computer  using  the  GRM  or  the  GSM  mapping. 
fi.T2    CRM  Manning 

In  the  GRM  mapping,  each  row  of  the  image  is  mapped  onto  an  OTIS  group  in 
snake-like  order.  A  simple  way  to  compute  right*  is  to  shift  the  gray  values  leftward 
by  q  units  along  the  snake.  A  left  shift  by  1  takes  1  electronic  left  move,  1  electronic 
right  move,  and  1  electronic  up  move.  Therefore,  right*  can  be  computed  using 
a  total  of  electronic  moves.  We  can  overlap  the  left  and  right  electronic  moves 
required  to  compute  left9  and  right*  so  that  the  total  number  of  moves  in  each  of 
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the  four  mesh  directions  is  q.  That  is,  we  can  compute  left*  and  right"  using  4q 
electronic  moves.  To  compute  top*  and  bottom9,  we  first  do  an  OTIS  move  so  that 
each  column  of  R*  is  in  a  single  group  in  snake-like  order;  then  we  run  the  left*  and 
right*  algorithm;  and  finally  do  another  OTIS  move  to  get  the  E*  values  back  in  the 
proper  locations.  Thus  E*  can  be  computed  using  8?  electronic  and  2  OTIS  moves. 
When  q  >  1.25>/N,  this  simple  strategy  can  be  improved  upon  as  below. 

When  q  >  1.25\/Nt  right*  of  a  column  0  processor  on  an  even  indexed  row 
(within  a  group)  depends  on  the  y/N  -  1  values  to  its  right  and  on  the  same  row, 
all  values  on  the  following  qf=[(q-VN  +  1)/VN\  =  [(q  +  l)/v^J  "  1  rows,  and 
qm  _  q  _  y/N  +  1  -  qfy/N  m  (q  + 1)  mod  y/N  values  in  the  row  qj  +  1  rows  away 
(Figure  6.1(a)  and  (b)).  Likewise,  right*  of  a  column  y/N  -  1  processor  on  an  odd 
indexed  row  depends  on  the  y/N- 1  values  to  its  left  and  on  the  same  row,  all  values 
on  the  following  qs  rows,  and  qm  values  in  the  row  qf  + 1  rows  away  (Figure  6.1(c) 
and  (d)). 

Also,  right*  of  a  column  t  processor  on  an  even  indexed  row,  i  /  0,  depends 
on  the  y/N-i  - 1  values  to  its  right  and  on  the  same  row,  all  values  in  the  following 
qf  rows,  and  an  additional  qm  +  t  values  (Figure  6.2(a)  and  (b)).  Similarly,  for  a 
column  t  processor  on  an  odd  indexed  row,  »  /  \/N  -  1,  right*  depends  on  the  t 
values  to  its  left  and  on  the  same  row,  all  values  in  the  following  qf  rows,  and  an 
additional  q^  +  i  values  (Figure  6.2(c)  and  (d)). 

When  qm  =  0  or  1,  the  additional  qm  +  i  values  lie  on  a  single  row  for  all 
i  €  [0,  y/N  -  1].  When  qm  =  0  or  1,  we  use  the  following  steps  to  compute  right*. 

Step  1:  Column  0  processors  compute  the  maximum  gray  value  in  their  row. 
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Figure  6.1.  Data  required  in  GUM  for  end  processor:  (a)  qf  even;  (b)  qf  odd;  (c)  qf 
even;  (d)  q/  odd 
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Figure  6.2.  Data  required  in  GRM  mapping  for  middle  processor:  (a)  qf  even;  (b)  qs 
odd;  (c)  qj  even;  (d)  g/  odd 
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Step  2:  Column  0  processors  shift  the  maximum  computed  in  Step  1  up  column  0 
for  qf  steps.  Each  column  0  processor  computes  the  max  of  the  qf  values  it 
receives. 

Step  S:  Each  processor  shifts  its  image  value  up  by  qf  + 1  rows.  Following  this  step, 
each  row  has  the  additional  sfN  values  it  needs  from  a  row  q}  +  1  rows  away. 

Step  4:  right*  may  now  be  computed  by  circulating  the  original  image  values  in  a 
row,  the  additional  values  from  a  row  qf  +  1  away,  and  the  max  value  of  the 
intermediate  q/  rows. 

Step  1  takes  y/N-l  leftward  moves,  Step  2  takes  qf  upward  moves,  and  Step 
3  takes  qf  + 1  upward  moves.  For  Step  4,  we  note  that  in  some  rows,  the  original  row 
values  are  to  be  shifted  left  while  in  others,  these  are  to  be  shifted  right.  Rows  which 
require  a  left  shift  of  same  row  values,  require  a  right  shift  of  the  row  qf  +  1  away 
values,  and  rows  that  require  a  right  shift  of  same  row  values,  require  a  left  shift  of 
row  qf  + 1  away  values.  So,  the  same  row  and  row  qf +1  away  values  can  be  circulated 
through  the  processors  that  need  these  values  using  2(VN- 1)  electronic  moves.  The 
rightward  circulation  of  the  max  value  of  the  intermediate  rows  can  be  done  with  one 
additional  move  by  pipelining  it  with  the  rightward  moves  for  that  row.  The  total 
number  of  moves  for  the  computation  oiright*  is  3{VN-\)+2qf+2  =  3\/tf+2ty-l 
electronic  and  0  OTIS.  The  computation  of  left*  requires  only  2(y/N  -l)  +  2qf  +  2 
electronic  moves  because  the  row  max  values  need  not  be  recomputed.  Thus  the  left* 
and  right*  values  may  be  computed  using  a  total  of  5(y/N  -  1)  +  4q/  +  4  electronic 
moves,  top*  and  bottom*  can  be  computed  similarly  using  5(VN  -  1)  +  %  +  4 
electronic  and  2  OTIS  moves.  Thus  the  total  number  of  moves  to  compute  E*  is 
10{>/N  -  1)  +  8?/  +  8  electronic  and  2  OTIS  moves. 
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Figure  6.3.  Data  required  in  GSM  mapping 

When  qm  >  1,  the  additional  values  needed  to  compute  right?  lie  on  two 
rows— one  is  q}  +  1  away  and  the  other  is  qj  +  2  away.  The  number  of  gray  values 
needed  from  a  row  qs  +  2  away  is  qm  - 1.  These  9m  -  1  values  can  reach  the  row  that 
needs  them  if  Step  3  of  the  qm  <  1  algorithm  shifts  upwards  for  qf  +  2  steps,  rather 
than  qf  +  1.  So,  only  one  additional  upward  move  is  needed.  The  row  qf  +  2  values 
can  be  circulated  in  Step  4  by  pipelining  their  movement  along  with  that  of  the  row 
qj  +  1  values.  This  increases  the  leftward  and  rightward  moves  by  qm  -  1  each.  The 
total  move  count  for  right?  increases  by  2qm  -  1  over  that  for  the  case  qm  <  1.  The 
new  count  for  E<  becomes  10(v^ -  1)  +  8«/  +  8fm  +  4  electronic  and  2  OTIS  moves. 

GSM  Mapping 

The  strategy  to  compute  E<  when  a  GSM  mapping  is  used  is  similar  to  that 
used  for  a  GRM  mapping.  Notice  that  in  a  GSM  mapping,  a  row  of  the  image  is 
distributed  over  y/N  groups  with  y/N  row  elements  per  group  (Figure  6.3). 

We  consider  three  cases:  (a)  q  <  VN,  0>)  qf  =  [(q  -  VN  +  l)/v^J  = 
[(q  +  l)/y/N\  -  1  /  0  and  qm  =  (q  +  1)  mod  =  0,  and  (c)  qf  >  0  and  qm  ^  0. 
The  case  qf  =  0  and  qm  <  1  is  included  in  (a). 

When  q  <  y/N,  the  gray  values  needed  to  compute  right9  for  processor 
{9z,9v,P*,Py)  are  in  {g^9y,Ps,*)  and  ($x,*  +  l,p*,*).  The  Steps  to  compute  right9 
are: 


Step  1:  Perform  the  following  sequence  of  moves  on  the  gray  values: 
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(&,ft  +  l,Px,Pif)  iPx,Py,9x,9y  +  ^) 

(Pz,PV,9x,9y) 
(9x,9y,Px,Py) 

This  moves  gray  values  from  (&,fflf  +  l,P«,P,)  to  (gx, gy,px,Py)-  Now.  each  row 
in  a  group  has  all  the  gray  values  needed  to  compute  right*  for  all  processors 
in  the  group  row. 

Step  2:  Shift  original  gray  values  leftwards  within  a  group  row  by  q  units. 
Step  S:  Shift  gray  values  received  in  Step  1  rightwards  by  \/N  -  1  units. 

Steps  2  and  3  cause  all  data  needed  by  a  processor  to  go  through  the  processor, 
enabling  the  processor  to  compute  its  right"  value.  The  total  number  of  moves  needed 
for  the  case  q  <  VN  is  y/N  +  q  electronic  and  2  OTIS  moves.  The  moves  needed  to 
compute  E*  become  4-/N  +  4q. 

When  qf  ^  0  and  qm  =  0,  the  right?  value  of  {gx,gy,Px,Py)  is  the  max  of 
I(gX}gy,px,py),  the  value  to  the  right  of  (gx,gytPx,Py)  and  in  the  same  row  and 
group,  the  max  of  the  values  in  +  i,p„  *),  1  <  *  <  ?/,  and  some  of  the  values 

(9x,9v  +  91  +  1»P*»  *)  (but  not  i3x,gv  +  9/  +  hP*,       -  1)).  The  steps  to  use  are: 

Step  1:  Processor  (gx,gy,px,0)  determines  the  max  of  the  gray  values  in  (gx,gy,px,  *)• 

Step  2:  Processor  {gx,gy,px,Pg)  sends  its  gray  value  to  (gx,gy,px,py  +  1),  0  <  py  < 
y/N-1. 
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Step  S:  Perform  the  following  move  sequence  on  the  max  values  computed  in  Step  1: 

(Pz,0,gs,g,  -  1) 
(p*,O,0„s,  -2) 

(Px,0,9x,9v  ~Qf) 
(9*1 9,~ 

During  the  electronic  moves,  each  processor  computes  the  max  of  the  qs  max 
values  that  pass  through  it. 

Step  4:  Perform  the  following  move  sequence  on  the  shifted  values  of  Step  2.  Note 
that  p,  /  0. 

(9z,9v,Px,Py)  (Px,Py,9x,9v) 

(Px,Py,9x,9v  ~  !) 

(P*  > Pjr  > 9y-Qf-  l) 
(9x,9y  ~  QfiPxiPy) 

This  moves  the  additional  values  that  processors  in  group  need  from 

group  (<7x,5»  +  ?/  +  l)- 


(<fe,0v>P*>°) 


B 
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Step  5:  Shift  the  max  of  the  max'  in  (gx,gv1Pz,0)  (received  in  Step  3)  and  the  addi- 
tional values  received  in  Step  4  by  (gz,  <7v,Px,Py),  Py  ^  0.  rightwards  y/N  -  1 
times  within  a  group  row. 

Step  6:  Shift  original  values  leftward  y/N  -  1  times  within  a  group  row. 

Steps  5  and  6  cause  all  data  needed  by  a  processor  to  go  through  it  and  enable 
the  processor  to  compute  right*.  Step  1  requires  y/N  -  1  electronic  moves,  Step  3 
requires  2  OTIS  and  qs  electronic  moves,  Step  4  requires  2  OTIS  and  qf  + 1  electronic 
moves,  and  Steps  5  and  6  each  require  y/N  - 1  electronic  moves.  Note,  however,  that 
all  moves  of  Step  3  can  be  overlapped  with  moves  of  Step  4.  Moreover,  Step  6  can  be 
combined  with  Step  1.  Therefore  right*  can  be  computed  using  2y/N  +  qf  electronic 
and  2  OTIS  moves. 

When  computing  left?,  Step  1  can  be  omitted  as  the  max  values  have  already 
been  computed  and  Step  2  is  not  required  either.  However,  Step  6  contributes  y/N-1 
electronic  moves  now.  So  left*  can  be  computed  with  an  additional  2y/N  +  ?/  -  1 
electronic  and  2  OTIS  moves,  top*  and  bottom*  can  be  computed  similarly.  The  total 
number  of  moves  needed  to  compute  E*  is  Sy/N  +  Aqj  -  2  electronic  and  8  OTIS 
moves. 

The  final  case  to  consider  is  when  qj  >  0  and  qm  /  0.  If  qj  =  0,  we  may 
assume  qm  >  1  because  the  case  q/  —  0  and  qj  <  1  is  covered  by  the  case  q  <  y/~N . 
Now  we  need  to  shift  values  from  two  adjacent  groups  (Figure  6.4);  y/N  from  the 
group  on  the  right  and  qm  —  1  from  the  group  two  to  the  right.  This  shifting  can  be 
done  using  the  move  sequence 

(gx,gv,Px,Py)  -2+  (pz,pv,gz,gy) 
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Figure  6.4.  Data  required  in  GSM  mapping  when  qf  =  0  and  qm  ±  0 

(j>x,Py,9x,9y  ~  1) 
(Px,Py,9x,9y  -  2) 
(yx,[^-l,^-2],Px,P») 


In  the  last  OTIS  move,  two  values  from  each  processor  are  moved.  These  are 
the  two  values  that  pass  through  the  processor  during  the  two  electronic  moves  stated 
above.  To  compute  right',  we  must  now  shift  the  original  gray  values  leftward  within 
each  group  \/N  -  1  times,  shift  the  values  from  the  right  adjacent  group  leftward 
qm  -  1  times,  and  shift  the  values  from  the  group  two  away  rightward  yffi -\  times. 
The  computation  of  right*  takes  2  OTIS  and  2^  +  qm  -  1  electronic  moves,  fc/f, 
top9,  and  bottom1  may  be  similarly  computed.  The  total  number  of  moves  to  compute 
E"  is  therefore  &VN  +  4gm  -  4  electronic  and  8  OTIS. 

If  qf  >  0,  we  must  also  compute  the  max  of  the  values  in  the  intermediate  qf 
groups  as  was  done  for  case  (b).  This  adds  4qf  electronic  and  0  OTIS  move  to  the 
computation  of  E*. 

fij    Hmifrh  Transform 

fi.4.1  Background 

The  Hough  transform  is  used  to  detect  straight  lines  or  edges  in  an  image. 
The  p  angle  Hough  transform  [42]  of  an  N  x  N  image  /  is  a  two-dimensional  array 
//  such  that 
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H[rJ]  =  |{(x, y)\r  =  Lxcos*,  +  ysin^J,  9j  =  JO  +  1)  ™*  *{*>v\  = 

Here  j  has  the  values  0,  1,  . . .,  p  -  1.  These  values  of  J  correspond  to  the  p 
angles  0,  =  JO  +  Since  °i  te  in  the  ran^  M  and  since  0  <  x,  y  <  AT,  i  is  in 
the  range  [-V2N,  V2N]. 

Parallel  algorithms  to  compute  the  Hough  transform  have  been  developed  for 
several  architectures.  Chaung  and  Li  [3]  and  Li  et  al.[29]  do  this  for  systolic  arrays; 
Rosenfeld  et  oil [44],  Kannan  and  Chaung  [20],  Cypher  et  o/.[5],  Guerra  and  Hambr- 
usch  [11],  and  Silberberg  [50]  consider  mesh  computers;  Fisher  and  Highnam  [8]  use  a 
scan  line  array;  Ibrahim  et  a/.[13]  uses  a  SIMD  tree;  Li  et  ot(27, 28],  Maresca  et  a/.[32], 
use  a  polymorphic  torus;  Ranka  and  Sahni  [42]  use  hypercube  computers;  Choudhary 
and  Ponnusamy  [4]  and  Thazhuthaveetil  and  Shah  [52]  use  shared-memory  multipro- 
cessors; Jenq  and  Sahni  [17]  use  reconfigurable  meshes;  and  Pavel  and  Akl  [39]  use 
optical  arrays.  Ulingworth  and  Kittler  [14]  provide  a  survey  of  work  related  to  the 

Hough  transform. 
4  9    An  Improved  Algorithm  Fflr  ^  *  &  Meshes 

Our  development  here  is  based  on  the  work  of  Jenq  and  Sahni  [17],  which  itself 
is  closely  related  to  the  work  reported  in  several  of  the  other  references  cited  above. 
The  computation  of  H  is  generally  broken  into  four  phases;  each  phase  computes  H 
for  a  certain  range  of  0,  .  The  four  ranges  are  (a)  0<9j<  0>)  *H  <  <  *l% 
(c)  x/2  <  6j  <  3*/4,  and  (d)  3tt/4  <  $j  <  r.  Since  the  computation  of  H  in  each 
of  these  9j  ranges  is  similar,  the  computation  is  explicitly  described  for  only  one 
of  the  four  ranges.  As  in  Jenq  and  Sahni  [17],  we  explicitly  consider  only  the  case 
*/2  <  6j  <  3ir/4. 

Jenq  and  Sahni  [17]  have  shown  how  to  compute  H[rJ]  for  r  >  0  and  j  such 
that  */2  <  $j  <  3ir/4  on  a  two-dimensional  mesh  computer  by  starting  tokens  on  two 
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Figure  6.5.  Coordinate  system  used  in  Hough  Transform 

boundaries  of  the  mesh  and  successfully  moving  these  tokens  towards  the  remaining 
two  boundaries.  These  tokens  accumulate  the  H[rJ]  values.  When  we  use  the  normal 
convention  of  locating  the  origin  of  the  coordinate  system  at  the  bottom-left  corner 
of  the  image  and  so  at  the  bottom  left  corner  of  the  N  x  N  mesh  (see  Figure  6.5), 
tokens  originate  at  the  left  and  bottom  boundaries  of  the  image/mesh  and  move  up 
and  right  till  they  reach  either  the  top  or  right  boundary.  For  x/2  <  9j  ^  3*/4>  the 
rules  governing  token  movement  are  derived  from  the  following  facts.  Here  6  is  the 
angle  and  r(x,  y)  =  |x  cos  9  +  y  sin  0J . 

(a)  If  r(x,  y)  =  r(x,  y  +  k)  for  some  k  >  0,  then  k  =  1. 

(b)  If  r(x, y)  =  r(x  +  l,c),  then  c  =  y  or  c  =  y  +  l. 

(c)  If  r(x,  y)  =  r(x,  y  +  1)  =  r(x  +  1 ,  c),  then  c  =  y  + 1. 


(d)  If  r(x,  y)  ^  r(z,  y  +  1)  for  z  >  x,  then  r(x,  y)  #  r(z,  w)  for  u>  >  y. 
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Using  these  facts,  we  arrive  at  the  following  algorithm  to  compute  H[rJ]  for 
a  single  angle  9  =  9j. 

Step  1:  [  Create  Tokens  on  Left  and  Bottom  Boundaries  ]  Processor  (0,  y)  creates  the 
token  (sinv,  cosv,  r,  n)  =  (sin 0,  cos 9,  r(0,  y),  7(0,  y])  provided  r  (0,  y)  ?  r(0,  y  - 
1).  Processor  (x,  0)  creates  the  token  (sin  9,  cos  9,  r(x,  0),  I[x,  0])  provided  r(x,  0)  # 
r(x-l,0). 

Step  2:  [  Move  Tokens  Up  ]  Let  {sinv,cosv,r,n)  be  the  token  (if  any)  in  processor 
(x,y).  If  r  =  r(x,y  +  1)  or  r  /  r(x  +  l,y),  move  the  token  to  (x,y  +  1).  If 
(x,y)  receives  a  token  (sinv,  cosv,  r1,  n')  in  this  step,  it  increments  n'  by  7[x,y] 
provided  r'  =  r(x,y). 

Step  3:  [  Move  Tokens  Right  ]  All  tokens  move  right  from  (x,y)  to  (x  +  l,y).  If 
(x,y)  receives  a  token  {sinv,cosv,r',n')  in  this  step,  it  increments  n'  by  7[x,y] 
provided  r*  =  r(x,y). 

5fep  4:  Repeat  Steps  2  and  3  until  all  tokens  reach  the  top  or  right  boundary. 

Theorem  6.1.1  The  Jour  step  procedure  given  above  is  correct 

Proof  To  establish  the  correctness  of  this  procedure,  we  must  show  that  every 
token  (sinv,cosv,rtn)  visits  all  processors  (x,y)  for  which  r(x,y)  =  r.  Note  that 
when  tc/2<9<  3x/4,  -l/\/2  <  cos0  <  0,  l/v/2  <  sin*  <  1,  and  sin0  +  cos0  >  0. 
Consider  the  configuration  following  Step  1.  If  token  (sinv,cosv,r,n)  is  in  (0,y) 
then  all  (x,z)  for  which  r(x,z)  =  r  satisfy  x  >  0  and  z  >  y.  Further,  if  the  token 
is  in  (x,0),  then  all  (z,y)  for  which  r(z,y)  =  r  satisfy  z  >  x  and  y  >  0.  Therefore, 
all  unreached  processors  with  the  same  r  value  can  be  reached  by  making  upward 
and  rightward  moves  alone.  For  any  token  (sinvycosv,^^  in  processor  (x,y),  let 
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property  P  be:  all  unreached  processors  (u,v)  with  r(u,v)  =  r  have  u  >  x  and 
t>  >  V  («.«•,  these  processors  can  be  reached  by  making  upward  and  rightward  moves 
alone).  We  have  already  shown  that  P  holds  following  Step  1.  We  shall  show  that  this 
property  holds  following  each  execution  of  Steps  2  and  3.  Therefore  the  algorithm  is 
correct.  To  establish  the  result,  we  will  also  need  to  show  that  at  the  start  of  Step  2 
r(x,  y)  =  r  (so,  we  need  not  check  r1  =  r(x  +  y)  in  Step  3).  The  first  time  Step  2  is 
initiated,  r(x,  y)  =  r  and  so  this  condition  holds.  Call  this  condition  Q.  Assume,  for 
the  induction  hypothesis,  that  P  and  Q  hold  at  the  start  of  each  execution  of  Step  2. 

Consider  a  token  that  moves  up  in  Step  2.  Suppose  that  r  =  r(x,y).  If 
r  =  r{x,y  +  1)  also,  then  r  =  r(x,y)  =  r(x,y  +  1).  From  fact  (c)  it  follows  that 
r  ^  r(x  + 1,  y).  From  this  and  cos  0  <  0,  it  follows  that  r  >  r(x  +  1,  y)  >  r{x  +  y), 
j  >  2.  Therefore  following  the  upward  move  of  the  token,  property  P  still  holds  for 
the  token.  From  fact  (a)%  r  =  r(x,y)  =  r(x,y  +  1),  and  sin*  >  0,  it  follows  that 
r  <  r(x,  y  +  j),  j  >  2.  Therefore,  following  the  rightward  move  of  this  token  in 
Step  3,  property  P  again  holds.  Following  the  Step  3  move  of  the  token,  the  token 
is  in  processor  (x  +  l,y  +  l)  and  r(x  +  l,y  +  l)  =  [(x  +  l)cos0  +  (y  +  l)sin0J  = 
[xcos0  +  ysin0  +  cos0  +  sin0j  >  [x cos 0  +  y sin 6\  =r  (because  cos 9  +  sin B  >  0). 
This,  together  with  the  knowledge  that  r  =  r(x,  y  + 1)  >  r(x  +  l,f +1),  implies  that 
r(x  +  1,  y  +  1)  =  r.  So  condition  Q  holds  after  Step  3. 

Next,  suppose  that  r  =  r(x,y),  r  /  r(x,y  +  1),  and  r  ^  r(x  +  l,y).  Again, 
the  token  moves  to  (x,y  +  1)  in  Step  2.  Since  -l/y/2  <  cos0  <  0,  r(x  +  l,y)  = 
r  -  1  >  r(x  +  ;',y),  j  >  2.  Therefore  moving  the  token  up  to  (x,y  +  1)  preserves 
property  P.  P  is  also  preserved  following  the  rightward  move  made  in  Step  3  because 
r(x,y+j)  >  r(x,y  +  l)  =  r  +  lj  >  2.  Further,  since  l/y/2  <  sin0  <  1,  r(x,y+l)  = 
r  +  1.  Since  r(x  +  l,y  +  l)  G  {r(x,y  +  l),r(x,y+ 1)  -  1}  =  {r  +  l,r}  as  well  as  in 
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{r(x  + 1,  y),  r{x  +  1,  y)  + 1}  =  {r  -  1,  r},  r(x  +  1,  y  + 1)  =  r.  Therefore  Q  also  holds 
following  Step  3. 

If  a  token  does  not  move  up  in  Step  2,  then  r  =  r(x,  y)  =  r(x+l,  y)  at  the  start 
and  end  of  Step  2.  Following  Step  3,  the  token  has  moved  to  (x+1,  y)  and  so  condition 
Q  holds.  Also  since  r  ^  r(x,y  +  1)  and  sin0  >  l/\/2,  r  <  r(x,y  +  1)  <  r(x,y  +  j), 
j  >  2.  Therefore  the  right  move  of  Step  3  preserves  property  P  also.  □ 

Theorem  6.4.1  establishes  the  correctness  of  our  N  x  N  mesh  single  angle 
Hough  transform  algorithm  by  showing  that  each  token  goes  through  every  processor 
with  the  same  r  value  as  that  of  the  token.  For  the  complexity  analysis,  we  first 
observe  that  there  can  never  be  more  than  one  token  in  any  processor.  To  see  this, 
observe  that  when  the  algorithm  completes  in  Step  1,  no  processor  has  more  than  1 
token.  If  two  tokens  t\  and  tj  end  up  in  the  same  processor  following  Step  2,  then  t\ 
must  be  in  (x,  y)  and  *2  in  (x,  y  +  1)  prior  to  the  upward  move  of  Step  2.  Further,  in 
Step  2,  ti  must  move  to  (x,  y  +  1)  and  <2  must  remain  in  (x,  y  +  1).  From  condition 
Q  of  Theorem  6.4.1,  we  know  that  the  r  value  rx  of  token  tx  must  be  r(x,y)  and 
r2  =  r(x,y  +  1).  Since  all  tokens  in  the  same  column  must  originate  in  the  same 
column  (because  tokens  move  right  at  the  same  rate),  rx  ^  r2  (alternatively,  since  all 
tokens  have  different  r  values,  ri  ^  r2).  Therefore,  r2  =  r(x,y  +  l)  =  rt  +  l.  For  ti  to 
move  up  and  ^  to  remain  in  (x,  y  + 1),  we  must  have  r(x  + 1,  y)  =  rx  - 1  (this  causes 
ti  to  move  up),  r(x  +  l,y  +  1)  =  rx  +  1  (this  causes  ^  to  not  move  up).  However, 
r(x  + 1,  y)  and  r(x  + 1,  y  + 1)  can  differ  by  at  most  1.  Therefore,  this  condition  is  not 
possible  and  so  two  tokens  cannot  be  in  the  same  processor  following  Step  2.  Since 
all  tokens  move  right  in  Step  3,  two  tokens  cannot  be  in  the  same  processor  after  Step 
3  unless  they  are  in  the  same  processor  before  Step  3.  Therefore,  it  is  not  possible 
for  two  tokens  to  ever  be  in  the  same  processor. 
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Since  tokens  are  always  in  different  processors,  the  upward  move  of  Step  2  can 
be  done  in  one  time  unit  and  the  rightward  move  of  Step  3  can  be  done  in  another 
time  unit.  Since  tokens  can  make  at  most  N-\  right  moves  before  reaching  the  right 
boundary,  at  most  N-l  iterations  of  Steps  2  and  3  are  needed.  Therefore,  the  single 
angle  Hough  transform  can  be  computed  with  2(W  -  1)  moves.  This  represents  an 
improvement  over  the  algorithm  of  Jenq  and  Sahni  [17]  which  takes  3(N  -  1)  moves 
to  compute  the  single  angle  Hough  transform. 

To  compute  the  p  angle  transform,  we  modify  the  basic  one  angle  algorithm  so 
that  the  token  originating  in  (x,0)  is  created  only  when  the  tokens  that  originated  in 
(0,  y)  reach  column  x;  that  is,  the  (x,  0)  token  is  created  after  x  rightward  moves  have 
been  made  in  Step  3.  With  this  change,  all  tokens  are  always  in  the  same  column  and 
correctness  is  not  affected.  Further,  when  tokens  reach  the  top  or  right  boundary, 
they  start  moving  down  from  the  top  boundary  or  left  from  the  right  boundary.  This 
avoids  an  accumulation  of  multiple  tokens  in  the  same  boundary  processor. 

Since  the  modified  algorithm  uses  only  one  column  of  the  N  x  N  mesh  at  any 
time  (excluding  the  backward  movement  of  tokens  from  the  top  and  right  boundaries), 
we  can  pipeline  the  computation  for  all  p/4  angles  in  the  range  x/2<0<  3tt/4.  The 
total  computation  takes  2(N  -  1)  +  2(p/4  -  1)  moves.  The  number  of  moves  needed 
for  all  p  angles  is  S{y/N  - 1)  +  2p  -  8.  A  final  sort  step  is  needed  to  create  the  array 
H  in  the  desired  format.  This  takes  an  additional  (4  +  e)N  moves  [38]. 
fi4.3    ARM  Manning 

We  can  simulate  the  single  angle  Hough  transform  algorithm  of  Section  6.4.2 
as  well  as  the  modified  version  of  this  algorithm  on  an  N7  processor  OTIS-Mesh  in 
which  the  image  has  been  mapped  using  the  GRM  mapping.  Each  execution  of  Step 
2  can  be  done  with  2  OTIS  moves  and  3  electronic  moves,  and  each  execution  of  Step 
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3  takes  3  electronic  moves.  The  total  number  of  moves  for  the  single  angle  transform 
becomes  6(N  -  1)  electronic  and  2(N  -  1)  OTIS  moves.  To  compute  the  transform 
for  all  p/4  angles  in  the  range  tt/2  <  0  <  3tt/4  takes  6(N  -  1)  +  6(p/4  -  1)  electronic 
and  2(N  - 1)  +  2(p/4  - 1)  OTIS  moves,  and  to  compute  the  transform  for  all  p  angles 
takes  24{N  -  1)  +  24(p/4  -  1)  electronic  and  8(N  -  1)  +  8(p/4  -  1)  OTIS  moves, 

exclusive  of  the  final  sort  step. 
fi4d    OSM  Manning 

When  the  GSM  mapping  is  used  we  first  compute  the  Hough  transform  for 
each  y/Nxy/N  mesh  of  the  OTIS-Mesh  (i.e.,  for  each  group  of  the  OTIS- Mesh)  and 
then  combine  the  results  from  all  N  groups.  The  number  of  moves,  exclusive  of  the 

combining  step,  is  8(v/77  -  1)  +  8(p/4  -  1). 

fi.S  Summary 

We  have  improved  upon  the  N  x  N  mesh  Hough  transform  algorithms  of  [17] 
and  [5].  We  are  able  to  compute  the  p  angle  Hough  transform  using  B(N- 1) +8(p/4- 
1)  moves  whereas  the  algorithm  of  [17]  takes  12(7V  - 1)  +  8(p/4  -  1)  moves  and  that 
of  [5]  takes  487V  +  20p+4  moves  (exclusive  of  the  final  sort  step).  The  p  angle  Hough 
transform  algorithms  takes  24(7V-l)+24(p/4-l)  electronic  and  8(AT-l)+8(p/4-l) 
OTIS  moves  when  the  GRM  mapping  is  used,  and  8(y/N-  1)  +  8(p/4  -  1)  electronic 
moves  when  GSM  mapping  is  used  (  both  exclusive  of  the  sort/combine  step  ). 

The  histogramming  algorithm  we  developed  takes  4(VN-l)+B-l  electronic 
and  2  OTIS  moves  when  0  <  B  <  y/N,  22y/N  +  0(N3^)  electronic  and  0{N3'*) 
OTIS  moves  when  B  >  N,  and  when  \/N  <  B  <  N,  6y/N  +  o(y/N)  electronic  moves 
and  2  OTIS  moves  for  0(1)  memory  per  processor,  and  4\/N  +  2\TB  -  6  electronic 
and  2  OTIS  moves  for  0(y/B)  memory  per  processor. 
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For  histogram  modification,  our  algorithm  for  0(1)  memory  per  processor  case 
takes  2SVN+o(VN)  electronic  and  2  OTIS  moves,  and  that  for  the  0(VN)  memory 
per  processor  case  takes  9(\/N  -  1)  electronic  and  2  OTIS  moves. 

Our  algorithm  for  image  shrinking  and  expanding  takes  1Q(\/N  -l)  +  Sqf  + 
8qm  +  i  electronic  and  2  OTIS  moves  when  the  GRM  mapping  is  used  and  8y/N  + 
4qf  +  4qm-4  electronic  and  8  OTIS  moves  when  the  GSM  mapping  is  used. 


CHAPTER  7 
OTIS-HYPERCUBE 

The  OTIS-Hypercube  is  another  class  of  the  OTIS  computer  in  which  the 
electronic  interconnect  follows  the  hypercube  paradigm.  In  an  N7  processor  OTIS- 
Hypercube,  each  group  is  a  hypercube  of  dimension  \og2N.  Figure  7.1  shows  a  16 
processor  OTIS-Hypercube.  The  number  inside  a  processor  is  the  processor  index 
within  its  group. 

In  this  chapter,  we  explore  the  properties  of  the  OTIS-Hypercube.  We  also 
develop  algorithms  for  the  frequently  used  permutations  and  BPC  permutations  listed 
in  Chapter  3. 

71  OTIS-FYPpmihp  diameter 
Let  N  =  2d  and  let  D(i,j)  be  the  length  of  the  shortest  path  from  processor 
i  to  processor  ;  in  a  hypercube.  Let  (ft.pi)  and  (<*,;*)  be  two  OTIS-Hypercube 
processors.  Similar  to  the  discussion  of  the  diameter  of  OTIS-Mesh  in  the  previous 
section,  The  shortest  path  between  these  two  processors  fits  into  one  of  the  following 
categories: 

(a)  The  path  employs  electronic  moves  only.  This  is  possible  only  when  gx  =  g*. 

(b)  The  path  employs  an  even  number  of  OTIS  moves.  If  the  number  of  OTIS  moves 
is  more  than  two,  we  may  compress  the  path  into  a  shorter  path  that  uses  2 
OTIS  moves  only:  (gx,pi)       (<ft,P2)       O>a,0i)       (P2,<fc)  fafr)- 

(c)  The  path  employs  an  odd  number  of  OTIS  moves.  Again,  if  the  number  of 
moves  is  more  than  one,  we  can  compress  the  path  into  a  shorter  one  that 


109 


110 


(0,0) 
group  0 


(0,1), 
group  1 


E*  . 


Figure  7.1.  16  processor  OTIS-Hypercube 
employs  exactly  one  OTIS  move.  The  compressed  path  looks  like:  {gupi) 

(91,92)         (92,9l)  (<72,P2)- 

Shortest  paths  of  type  (a)  have  length  exactly  LKpi.pa)  (which  equals  the 
number  of  ones  in  the  binary  representation  of  pi  Spa).  Paths  of  type  (b)  and  type 
(c)  have  length  Dfo.p,)  +  Digufr)  +  2  and  Dfo.ft)  +  DfagJ  +  h  respectively. 

The  following  theorem  follows  from  the  preceding  discussion: 
Thenre.m  7.1.1  The  length  of  the  shortest  path  between  processors  (gupi)  and  (ft.pa) 
isd(pi,p2)  when  gx  =  g*  andm\n{D{pl}p2)  +  D(gug2)  +  2,D(p1,fr)  +  D(p2,gi)  +  l} 
when  gi  ^  ft. 

Theorem  7.1.2  The  diameter  of  the  OTIS-Hypercvbe  is  2d+  1. 

Proof  Since  each  group  is  a  d-dimensional  hypercube,  D(pi,P2),  D(gi,g2),  ^(pi.ft), 

and  D(pa,<7i)  are  all  less  than  or  equal  to  d.  From  Theorem  7.1.1,  we  conclude  that 
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no  two  processors  are  more  than  2d  +  1  apart.  Now  consider  the  processors  (51, Pi), 
(ff2,P2)  such  that  pi  =  0  and  pa  =  N  -  1.  Let  #  =  0  and  pa  =  JV  -  1.  So 
D(pi,Pa)  =  D(gug2)  =  £>(pi,ft)  =  D{P*GX)  =  d.  Hence,  the  distance  between 
(guPl)  and  Gfc.pa)  is  2d  +  1.  As  a  result,  the  diameter  of  the  OTIS-Mesh  is  exactly 
2d  +  l.  □ 

1.2  Simulation  nf  an  N7  hvnercube 
Zane  et  al  [58]  have  shown  that  each  move  of  an  TV2  processor  hypercube 
can  be  simulated  by  either  a  single  electronic  move  or  by  one  electronic  and  two 
OTIS  moves  in  an  processor  OTIS-Hypercube.  For  the  simulation,  processor  q 
of  the  hypercube  is  mapped  to  processor  (g,p)  of  the  OTIS-Hypercube.  Here  gp  =  q 
(i.e.,  g  is  obtained  from  the  most  significant  log2JV*  bits  of  q  and  p  comes  from 
the  least  significant  log2  N  bits).  Let  gd-\  •  •  -go  and  pd-X  •  •  po,  d  =  log2  N,  be  the 
binary  representations  of  g  and  p  respectively.  The  binary  representation  of  q  is 
qu-i  ■'•$>  =  gd-i  •  •  •  goPd-i  •  •  -Po-  A  hypercube  move  moves  data  from  processor  q 
to  processor  qlk)  where  is  obtained  from  q  by  complementing  bit  k  in  the  binary 
representation  of  q.  When  Jb  is  in  the  range  [0,  d),  the  move  is  done  in  the  OTIS- 
Hypercube  by  a  local  intragroup  hypercube  move.  When  k  >  d,  the  move  is  done 
using  the  steps 


o 

K  , 
O 


(gd-1  •  •  •  0j f  ■  •  •  9oPd-l  •  •  Po) 

(Pd-i-Po9d-i--9l'-9o) 
(gd-\--9j---9oPd-i--Po) 


where  j  =  k  —  d. 
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Table  7.1.  Optimal  moves  for  AT2  =  2M  processor  hypercube  and  respective  OTIS- 
Hypercube  simulations 


Permutation 

Optimal  Hypercube  Moves 

Simulation 

total 

group  dim. 

local  dim. 

OTIS 

electronic 

Transpose 

2d 

d 

& 

2d 

2d 

Perfect  Shuffle 

2d 

d 

d 

2d 

2d 

Unshuffle 

2d 

d 

d 

2d 

2d 

Bit  Reversal 

2d 

d 

d 

2d 

2d 

Vector  Reversal 

2d 

d 

d 

2d 

2d 

Bit  Shuffle 

2d -2 

d-  1 

d-\ 

2d-2 

2d-  2 

Shuffled  Row-major 

2d -2 

d-1 

d-1 

2d-2 

2d  -  2 

GVPS  Swap 

d 

d/2 

d/2 

d 

d 

7.3    Common  Data  Rearrangements 

In  this  section,  we  concentrate  on  the  realization  of  permutations  such  as 
transpose,  perfect  shuffle,  unshuffle,  vector  reversal  which  are  frequently  used  in  ap- 
plications. Nassimi  and  Sahni  [36]  have  developed  optimal  hypercube  algorithms  for 
these  frequently  used  permutations.  These  algorithms  may  be  simulated  by  an  OTIS- 
Hypercube  using  the  method  of  Zane  et  aL  [58]  to  obtain  algorithms  to  realize  these 
data  rearrangement  patterns  on  an  OTIS-Hypercube.  Table  7.1  gives  the  number  of 
moves  used  by  the  optimal  hypercube  algorithms;  a  break  down  of  the  number  of 
moves  in  the  group  and  local  dimensions;  and  the  number  of  electronic  and  OTIS 
moves  required  by  the  simulation. 

We  shall  obtain  OTIS-Hypercube  algorithms,  for  the  permutations  of  Ta- 
ble 7.1,  that  require  far  fewer  moves  than  the  simulations  of  the  optimal  hypercube 
algorithms. 
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As  mentioned  before,  each  processor  is  indexed  as  (G,P)  where  G  is  the 
group  index  and  P  the  local  index.  An  index  pair  (G,  P)  may  be  transformed  into  a 
singleton  index  /  =  GP  by  concatenating  the  binary  representations  of  G  and  P. 

The  permutations  of  Table  7.1  are  members  of  the  BPC  (bit-permute-complement) 

class  of  permutations  defined  in  Nassimi  and  Sahni  [36].  The  definition  of  the  BPC 

permutation  can  be  located  in  Section  3.8.2. 
7.3.1    Transpose  [p/2  -  1  Q.  P  -  1  p/2l 

The  transpose  operation  may  be  accomplished  via  a  single  OTIS  move  and 

no  electronic  moves.  The  simulation  of  the  optimal  hypercube  algorithm,  however, 

takes  2d  OTIS  and  2d  electronic  moves. 
7  3  2    Perfect  Shuffle  [0.  P  -  1 .  P  -  2  ll 

We  can  adapt  the  strategy  of  Nassimi  and  Sahni  [36]  to  an  OTIS-Hypercube. 
Each  processor  uses  two  variables  A  and  B.  Initially,  all  data  are  in  the  A  variables 
and  the  B  variables  have  no  data.  The  algorithm  for  perfect  shuffle  is  given  below: 

Step  1:  Swap  A  and  B  in  processors  with  last  two  bits  equal  to  01  or  10. 

Step*  for  (t  =  l;t<d-l;t  +  +)  { 

(a)  Swap  the  B  variables  of  processors  that  differ  on  bit  t  only; 

(b)  Swap  the  A  and  B  variables  of  processors  with  bit  i  of  their 
index  /  not  equal  to  bit »'  + 1  of  their  index;  } 

Step  S:  Perform  an  OTIS  move  on  the  A  and  B  variables. 

Step  4:  for  (t  =  0;  i  <  d  -  1;  i  +  +)  { 

(a)  Swap  the  B  variables  of  processors  that  differ  on  bit  i  only; 

(b)  Swap  the  A  and  B  variables  of  processors  with  bit  i  of  their 
index  /  not  equal  to  bit  i  +  1  of  their  index;  } 
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Step  5:  Perform  an  OTIS  move  on  the  A  and  B  variables. 

Step  6:  Swap  the  B  variables  of  processors  that  differ  on  bit  0  only. 

Step  7:  Swap  the  A  and  B  variables  of  processors  with  last  two  bits  equal  to  01  or 
10. 

Actually,  in  Step  1  it  is  sufficient  to  copy  from  A  to  B,  and  in  Step  7  to  copy 
from  B  to  A. 

Table  7.2  shows  the  working  of  this  algorithm  on  a  16  processor  OTIS-Hypercube. 
The  correctness  of  the  algorithm  is  easily  established,  and  we  see  that  the  number  of 
data  move  step  is  2d  +  2  (2d  electronic  moves  and  2  OTIS  moves;  each  OTIS  move 
moves  two  pieces  of  data  from  one  processor  to  another,  each  electronic  swap  moves 
a  single  data  between  two  processors). 

The  communication  complexity  of  2d+2  is  very  close  to  optimal.  For  example, 
data  from  the  processor  with  index  I  =  0101 . . .  0101  is  to  move  to  the  processor  with 
index  V  =  1010 ...  1010  and  the  distance  between  these  two  processors  is  2d  +  1. 

Notice  that  the  simulation  of  the  optimal  hypercube  algorithm  for  perfect 

shuffle  takes  4d  moves. 

7.3.3  Unshuffle  [p-2.p-3  O.P-ll 

This  is  the  inverse  of  a  perfect  shuffle  and  may  be  performed  by  running  the 
perfect  shuffle  algorithm  mentioned  above  backwards  (i.e.,  beginning  with  Step  7); 
the  for  loops  of  Steps  2  and  4  are  also  run  backwards.  Thus  the  number  of  moves  is 

the  same  as  for  a  perfect  shuffle. 

7.3.4  Bit  Reversal  [0. 1  p  -  1) 

When  simulating  the  optimal  hypercube  algorithm,  the  task  requires  2d  elec- 
tronic moves  and  2d  OTIS  moves.  But  with  the  following  algorithm: 
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Table  7.2.  Illustration  of  the  perfect  shuffle  algorithm  on  a  16  processor  OT1S- 
Hypercube 


Step 

1 

2 

OTIS 

4 

OTIS 

6 

7 

index 

initial 

t 

=  1 

i 

=  0 

s 

=  1 

var. 

(a) 

(b) 

(a) 

(b) 

(a) 

(b) 

0000 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

A 

- 

- 

2 

2 

2 

4 

4 

8 

8 

8 

- 

- 

B 

0001 

1 

- 

_ 

6 

6 

2 

2 

2 

- 

- 

8 

A 

- 

1 

4 

2 

6 

10 

10 

- 

8 

- 

B 

0010 

2 

- 

_ 

8 

8 

12 

12 

4 

- 

- 

1 

A 

- 

2 

_ 

m 

10 

12 

8 

4 

12 

- 

1 

- 

B 

0011 

3 

3 

3 

1 

14 

14 

14 

14 

6 

9 

9 

9 

A 

- 

- 

1 

3 

12 

10 

10 

6 

14 

1 

- 

- 

B 

0100 

4 

4 

4 

6 

- 

_ 

_ 

- 

- 

2 

2 

2 

A 

- 

- 

6 

4 

- 

_ 

- 

- 

- 

10 

- 

- 

B 

0101 

5 

10 

A 

- 

5 

_ 

- 

_ 

_ 

- 

- 

10 

- 

B 

0110 

6 

3 

A 

- 

6 

m 

- 

_ 

- 

- 

- 

3 

- 

B 

0111 

7 

7 

7 

7 

- 

- 

- 

- 

- 

11 

11 

11 

A 

- 

- 

5 

5 

- 

- 

- 

- 

3 

- 

- 

B 

1000 

8 

8 

8 

8 

- 

- 

- 

- 

4 

4 

4 

A 

- 

- 

10 

10 

- 

_ 

_ 

12 

- 

- 

B 

1001 

9 

12 

A 

- 

9 

_ 

- 

m 

_ 

- 

12 

- 

B 

1010 

10 

5 

A 

- 

10 

- 

- 

5 

- 

B 

1011 

11 

11 

11 

9 

13 

13 

13 

A 

9 

11 

5 

B 

1100 

12 

12 

12 

14 

1 

1 

1 

1 

9 

6 

6 

6 

A 

14 

12 

3 

5 

5 

9 

1 

14 

B 

1101 

13 

7 

7 

3 

3 

11 

14 

A 

13 

5 

3 

7 

11 

3 

14 

B 

1110 

14 

9 

9 

13 

13 

13 

7 

A 

14 

11 

13 

9 

5 

5 

7 

B 

1111 

15 

15 

15 

15 

15 

15 

15 

15 

15 

15 

15 

15 

A 

13 

13 

13 

11 

11 

7 

7 

7 

B 
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Step  1:  Do  a  local  bit  reversal  in  each  group. 
Step  2:  Perform  an  OTIS  move  of  all  data. 
Step  8:  Do  a  local  bit  reversal  in  each  group. 

we  can  actually  achieve  the  rearrangement  in  2d  electronic  moves  and  1  OTIS  move, 
since  Steps  1  and  3  can  be  performed  optimally  in  d  electronic  moves  each  [36]. 

The  number  of  moves  is  optimal  since  the  data  from  processor  0101 . .  .0101 
is  to  move  to  processor  1010 . . .  1010,  and  the  distance  between  these  two  processors 
is  2d+l  (Theorem  7.1.2). 

7.3.5    Vector  Reversal  [-(v  -\).-(v-2)  -Ql 

A  vector  reversal  can  be  done  using  2d  electronic  and  2  OTIS  moves.  The 
steps  are  as  follows: 

Step  1:  Perform  a  local  vector  reversal  in  each  group. 
Step  2:  Do  an  OTIS  move  of  all  data. 
Step  3:  Perform  a  local  vector  reversal  in  each  group. 
Step  4:  Do  an  OTIS  move  of  all  data. 

The  correctness  of  the  algorithm  is  obvious.  The  number  of  moves  is  computed 
using  the  fact  that  Steps  1  and  3  can  be  done  in  d  electronic  moves  each  [36]. 

Since  a  vector  reversal  requires  us  to  move  data  from  processor  00 ...  00  to 
processor  11...  11,  and  since  the  distance  between  these  two  processors  is  2d  +  1 
(Theorem  7.1.2),  our  vector  reversal  algorithm  can  be  improved  by  at  most  one 
move. 
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7.3.6    Bit  Shuffle  [p  -  1 .  D  -  3  1.  V  -  2,  P  -  4, , ,  ■ ,  Ql 

Let  G  =  GuGi  where  G«  and  Gt  partition  G  in  half.  Same  for  P  =  P„P,.  Our 
algorithm  employs  a  G/P*  Swap  permutation  in  which  data  from  processor  G%GtP%Pi 
is  routed  to  processor  GHP^GtPi.  So  we  need  to  first  look  at  how  this  permutation  is 
performed. 

a,P..  Swan  [i,  -  i  ,W4.  p/2  -  1  p/4, 3ff/4  -  1  p/2, p/4  -  1  Ql 

The  swap  is  performed  by  a  series  of  bit  exchanges  of  the  form  B(i)  = 

[Bp-U . . . ,  50],  0  <  »  <  p/4,  where 

f  p/2  +  i,  j  =  p/4  +  i 
Bj  =  \  p/4  +  t,  j=p/2  +  i 
[  j  otherwise 

Let  and  P(t)  denote  the  ith  bit  of  G  and  P  respectively.  So  G{0)  is  the 
least  significant  bit  in  G,  and  P(d)  is  the  most  significant  bit  in  P.  The  bit  exchange 
B(i)  may  be  accomplished  as  below: 

Step  1:  Every  processor  (G,  P)  with  G(t)  ^  P(d/2+i)  moves  its  data  to  the  processor 
(Gt  P1)  where  P*  differs  from  P  only  in  bit  d/2  + »'. 

Step  2:  Perform  an  OTIS  move  on  the  data  moved  in  Step  1. 

Step  8:  Processors  (G,  P)  that  receive  data  in  Step  2  move  the  received  data  to 
(G,  P1),  where  P1  differs  from  P  only  in  bit  i. 

Step  4:  Perform  an  OTIS  move  on  the  data  moved  in  Step  3. 
The  cost  is  2  electronic  moves  and  2  OTIS  moves. 

To  perform  a  G/Pu  Swap  permutation,  we  simply  do  B(i)  for  0  <  i  <  d/2. 
This  takes  d  electronic  moves  and  d  OTIS  moves.  By  doing  pairs  of  bit  exchanges 
(P(0),B(1)),  (5(2),  J3(3)),  etc.  together,  we  can  reduce  the  number  of  OTIS  moves 
to  d/2. 
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Pit  Shuffle 

A  bit  shuffle,  now,  can  be  performed  following  these  steps: 
Step  1:  Perform  a  GiPu  swap. 
Step  2:  Do  a  local  bit  shuffle  in  each  group. 
Step  3:  Do  an  OTIS  move. 
Step  4:  Do  a  local  bit  shuffle  in  each  group. 

Step  5:  Do  an  OTIS  move. 

Steps  2  and  4  are  done  using  the  optimal  d  move  hypercube  bit  shuffle  algo- 
rithm of  Nassimi  and  Sahni  [36].  The  total  number  of  data  moves  is  3d  electronic 
moves  and  d/2  +  2  OTIS  moves. 

7.3.7    Shuffled  Row-maior  [p  -  1.  p/2  -  1.  V  -  2,  v/2  -  2  p/2,  Ql 

This  is  the  inverse  of  a  bit  shuffle  and  may  be  done  in  the  same  number  of 
moves  by  running  the  bit  shuffle  algorithm  backwards.  Of  course,  Steps  2  and  4  are 
to  be  changed  to  shuffled  row-major  operations. 


Every  BPC  permutation  A  can  be  realized  by  a  sequence  of  bit  exchange 
permutations  of  the  form  B(i,j)  =  [B^-u  •  •  • .  fl»Ji  d  <  »  <  2d,  0  <    <  d,  and 


and  a  BPC  permutation  C  =  [CM_i, . . . ,  C0]  =  Ug^p  where  \Cq\  <d,0<q<d,TlG 
and  lip  involve  d  bits  each. 

For  example,  the  transpose  permutation  may  be  realized  by  the  sequence 
B(d  +         0  <  j  <  d;  bit  reversal  is  equivalent  to  the  sequence  B(2d  —  1  -  j,  j), 


7.4   BPC  Permutations 


J,   9  =  * 
9  =  j 
q,  otherwise, 
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0  <  j  <  d;  vector  reversal  can  be  realized  by  performing  no  bit  exchanges  and 
using  C  =  [-(2d-l),-(2d-2),...,-0]  (Tic  =  [-(2d  -  1), -(2d  -  2), . . . , -d], 
Tip  =  [-(d  -  1), ... ,  -0]);  and  perfect  shuffle  may  be  decomposed  into  B(d,0)  and 
C  =  [2d-2, 2d-3, . . . ,  d,  2d- 1,  d-2,  ...,1,0,  d-1]  (Ilc  =  [2d-2, 2d-3, . . . ,  d,  2d- 1], 
n,»  =  [d-2,...,l,0,d-l]). 

A  bit  exchange  permutation  B(i,j)  can  be  performed  in  2  electronic  moves  and 
2  OTIS  moves  using  a  process  similar  to  that  used  for  the  bit  exchange  permutation 
B{i).  Notice  that  B(i)  =  B(i,i). 

Our  algorithm  for  general  BPC  permutations  is: 

Step  1:  Decompose  the  BPC  permutation  A  into  the  pair  cycle  moves  £i(ti,ji), 
52(t2,j2),...,  Bk(ik,jk)  and  the  BPC  permutation  C  =  UcUp  as  above.  Do 
this  such  that  ii  >  it  >  •  ••  >  U,  «nd  j\  >  j*  >  ••  •  >  jk- 

Step  2:  If  Jt  =  0,  do  the  following: 

Step  2.1:  Do  the  BPC  permutation  UP  in  each  group  using  the  optimal  algo- 
rithm of  Nassimi  and  Sahni  [36]. 

Step  2.2:  Do  an  OTIS  move. 

Step  2.3:  Do  the  BPC  permutation  Yi'G  in  each  group  using  the  algorithm  of 
Nassimi  and  Sahni  [36]. 

Step  2.4:  Do  an  OTIS  move. 

Step  8:  If  Jk  =  d,  do  the  following: 

Step  8.1:  Do  the  BPC  permutation  II^  in  each  group. 
Step  S.2:  Do  an  OTIS  move. 
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Step  3.3:  Do  the  BPC  permutation  UP  in  each  group. 
Step  4:  Uk<  d/2,  do  the  following: 

Step  4.1:  Perform  the  bit  exchange  permutation  B\t . . . ,  Bk. 
Step  4.2:  Do  Steps  2.1  through  2.4. 

Step  5:  If  Jfc  >  d/2,  do  the  following: 

Step  5.1:  Perform  a  sequence  of  d  -  k  bit  exchanges  involving  bits  other  than 
those  in  B\,...,Bk  in  the  same  orderly  fashion  described  in  Step  1.  Re- 
compute Uc  and  Up.  Swap  Uc  and  Up. 

Step  5.2:  Do  Steps  3.1  through  3.3. 

The  local  BPC  permutations  determined  by  nG  and  UP  take  at  most  d  elec- 
tronic moves  each  [36];  and  the  bit  exchange  permutations  take  at  most  d  electronic 
moves  and  d/2  OTIS  moves.  So  the  total  number  of  moves  is  at  most  3d  electronic 
moves  and  d/2  +  2  OTIS  moves. 

7.5  Comparison 

In  this  chapter  we  have  shown  that  the  diameter  of  the  OTIS-Hypercube  is 
2d+l,  which  is  very  close  to  that  of  an  TV2  processor  hypercube.  However,  each  OTIS- 
Hypercube  processor  is  connected  to  at  most  d  +  1  other  processors;  while  in  an  TV2 
processor  hypercube,  a  processor  is  connected  to  up  to  2d  other  processors.  We  have 
also  developed  algorithms  for  frequently  used  data  permutations.  Table  7.3  lists  the 
performance  comparisons  between  our  algorithms  and  those  obtained  by  simulating 
the  optimal  hypercube  algorithms  using  the  simulation  technique  of  Zane  et  al.  [58]. 
For  most  of  the  permutations  considered,  our  algorithms  are  either  optimal  or  within 
one  move  of  being  optimal. 


Table  7.3.  Complexity  Comparison  of  Common  Data  Rearrangement 


Permutation 

Simulation 

Ours 

electronic 

OTIS 

electronic 

OTIS 

Transpose 

2d 

2d 

0 

1 

Perfect  Shuffle 

2d 

2d 

2d 

2 

Unshuffle 

2d 

2d 

2d 

2 

Bit  Reversal 

2d 

2d 

2d 

1 

Vector  Reversal 

2d 

2d 

2d 

1 

Bit  Shuffle 

2d-  2 

2d -2 

3d 

d/2  +  2 

Shuffled  Row-major 

2d -2 

2d -2 

3d 

d/2  +  2 

CHAPTER  8 
CONCLUSION 

The  OTIS  computer  is  a  relatively  new  architecture  that  contains  both  optical 

and  electronic  connections.  Optical  interconnect  is  used  to  connect  pairs  of  processors 

that  are  on  different  groups.  Electronic  interconnect  is  used  inside  each  group  and 

is  flexible.  The  reason  for  the  combination  is  that  optical  interconnect  is  superior  to 

electronic  interconnect  for  long  interconnects,  but  not  for  short  (a  few  millimeters) 

ones.  Different  classes  of  OTIS  computers  are  obtained  by  using  different  topologies 

to  realize  the  electronic  interconnect.  This  hybrid  computer  combines  the  best  of  both 

worlds,  can  be  easily  realized,  and  has  tremendous  computing  power.  In  the  following 

sections,  we  briefly  summarize  the  results  we  have  obtained  in  this  dissertation,  and 

discuss  some  of  the  problems  that  remain  to  be  explored. 

81    Outline  of  the  Results 

We  have  reviewed  the  definition  of  the  optical  transpose  interconnection  sys- 
tem (OTIS)  and  its  construction  via  free-space  optics  (a  pair  of  arrays  of  lenslets). 
Different  classes  of  OTIS  computers  can  be  obtained  by  using  different  electronic 
interconnect  topologies.  Moreover,  the  OTIS  computer  can  be  utilized  as  multistage 
interconnection  network  (MIN)  using  only  2  OTIS  connections. 

The  diameter  of  an  N7  processor  OTIS-Mesh  computer,  in  which  the  mesh 
topology  is  used  for  the  electronic  interconnect,  is  4VN—3.  It  is  shown  that  the  OTIS- 
Mesh  can  be  used  to  simulate  a  2D-mesh  using  either  the  group  row  mapping  (GRM) 
or  the  group  submesh  mapping  (GSM).  OTIS-Mesh  algorithms  for  the  commonly  used 
permutations — transpose,  perfect  shuffle,  unshuffle,  bit  reversal,  vector  reversal,  bit 
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shuffle,  and  shuffled  row-majored— are  developed.  Among  them  transpose  and  bit 
reversal  are  shown  to  be  optimal.  All  of  them  are  better  than  the  4D-mesh  simulation. 

The  BPC  permutation,  of  which  all  those  commonly  used  permutations  are 
members,  can  realize  a  large  variety  of  permutations.  The  OTIS-Mesh  algorithm  for 
the  general  BPC  permutation  is  thus  developed. 

A  complete  set  of  basic  data  manipulation  operations— i.e.,  broadcast,  win- 
dow broadcast,  prefix  sum,  data  sum,  rank,  shift,  data  accumulation,  consecutive 
sum,  adjacent  sum,  concentrate,  distribute,  generalize,  sorting,  random  access  read 
(RAR),  and  random  access  write  (RAW)— are  shown  with  their  corresponding  OTIS- 
Mesh  algorithms.  Among  them  we  show  that  the  algorithms  for  broadcast,  data  sum, 
concentrate,  distribute,  and  generalize  are  optimal.  The  rest  of  them,  besides  sorting, 
RAR,  and  RAW,  are  close  to  optimal.  These  algorithms  can  be  used  for  a  great  num- 
ber of  applications,  like  image  processing,  computational  geometry,  matrix  algebra, 
graph  theory,  and  so  forth. 

We  demonstrate  how  the  various  matrix  multiplication  operations— vector  x 
vector,  vector  x  matrix,  matrix  x  vector,  matrix  x  matrix— are  accomplished.  Since 
the  mapping  of  the  matrix  onto  the  OTIS-Mesh  has  a  profound  consequence  on  the 
outcome,  both  GRM  and  GSM  mapping  schemes  are  included. 

We  also  explore  some  problems  related  to  image  processing— histogramming 
and  histogram  modification,  Hough  transform,  and  image  shrinking  and  expanding. 

Another  class  of  OTIS  computer,  OTIS-Hypercube,  is  discussed,  in  which 
the  electronic  interconnect  follows  the  hypercube  paradigm.  The  diameter  of  a  N* 
processor  OTIS-Hypercube  is  2d  + 1,  where  d  =  log2  N.  OTIS-Hypercube  algorithms 
for  commonly  used  permutations  are  developed,  and  they  are  either  optimal  or  close 
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to  optimal.  Also,  the  OTIS-Hypercube  algorithm  for  general  BPC  permutations  is 
presented. 

The  properties  of  OTIS  computers  are  obtained  mainly  due  to  the  combi- 
nation of  the  characteristics  of  both  optical  and  electronic  interconnects.  Although 
the  technology  is  still  evolving,  the  results  obtained  so  far  are  promising.  Further 
investigations  are  encouraged  as  many  open  problems  remain  to  be  solved  and  many 
questions  to  be  answered. 

Numerous  parts  of  the  dissertation  are  published  and  can  be  found  in  [46,  47, 
55,  54,  53,  56,  48,  45]. 

8  2    Open  Problems 

Several  questions  are  considered  to  be  related  to  this  type  of  architecture, 
ranging  from  the  basics  of  technologies  to  algorithms: 

•  So  far  we  separate  the  marks  for  electronic  and  OTIS  moves,  since  at  this 
time  it  is  difficult  to  see  how  the  technology  could  progress.  At  one  hand 
the  more  transistors  that  can  be  put  on  to  a  chip,  the  better  the  electronic 
interconnect  is.  On  the  other  hand  the  optical  technology  is  still  evolving, 
which  could  further  increase  the  number  of  advantages  optical  interconnect  has 
over  electronic  interconnect.  The  combination  of  this  could  change  the  partition 
of  the  group,  and  give  preference  to  one  over  the  other. 

•  There  are  quite  a  few  operations  for  which  sub-optimal  algorithms  are  obtained. 
The  difference  between  those  with  optimal  ones  are  usually  only  a  constant 
amount  of  steps  away.  There  might  be  some  way  to  make  them  optimal. 

•  The  deterministic  sorting  we  presented  does  not  have  a  great  improvement 
over  the  simulated  version.  There  could  be  a  better  way  to  further  increase  the 
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difference  between  them.  Also,  we  only  concern  ourselves  with  general  sorting. 
Maybe  there  is  a  simpler  and/or  faster  algorithm  for  integer  sorting. 

•  We  know  that  arbitrary  permutations  can  be  realized  through  sorting.  But  the 
OTIS  computer  is  shown  to  be  capable  of  performing  this  task  as  a  MIN.  If  the 
whole  permutation  vector  is  known  before  hand  by  each  group,  there  could  be  a 
way  to  perform  the  permutation  faster  than  sorting.  Further,  if  each  processor 
only  knows  its  destination,  there  could  be  a  better  algorithm  to  accomplish  the 
task  rather  than  use  sorting. 

•  There  might  be  other  2D-mesh  mapping  such  that  it  would  reduce  the  slowdown 
factor  with  either  GRM  or  GSM.  Moreover,  this  possible  mapping  can  further 
simplify  and  speed  up  the  matrix  multiplication  operations. 

•  There  are  still  other  interesting  applications  that  are  worth  exploring,  like  con- 
vex hull,  FFT,  etc. 
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